Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
While Chapter 7 focuses on one special kind of collection types, namely sequences, this chapter introduces two more kinds: Mappings and sets. Both are presented in this chapter as they share the same underlying implementation.
The dict
type (cf, documentation ) introduced in the next section is an essential part in a data scientist's toolbox for two reasons: First, Python employs
dict
objects basically everywhere internally. Second, after the many concepts involving sequential data, mappings provide a different perspective on data and enhance our general problem solving skills.
dict
Type¶A mapping is a one-to-one correspondence from a set of keys to a set of values. In other words, a mapping is a collection of key-value pairs, also called items for short.
In the context of mappings, the term value has a meaning different from the value every object has: In the "bag" analogy from Chapter 1 , we describe an object's value to be the semantic meaning of the 0s and 1s it contains. Here, the terms key and value mean the role an object takes within a mapping. Both, keys and values, are objects on their own with distinct values.
Let's continue with an example. To create a dict
object, we commonly use the literal notation, {..: .., ..: .., ...}
, and list all the items. to_words
below maps the int
objects 0
, 1
, and 2
to their English word equivalents, "zero"
, "one"
, and "two"
, and from_words
does the opposite. A stylistic side note: Pythonistas often expand dict
or list
definitions by writing each item or element on a line on their own. Also, the commas ,
after the respective last items, 2: "two"
and "two": 2
, are not a mistake although they may be left out. Besides easier reading, such a style has technical advantages that we do not go into detail about here (cf., source ).
to_words = {
0: "zero",
1: "one",
2: "two",
}
from_words = {
"zero": 0,
"one": 1,
"two": 2,
}
As before, dict
objects are objects on their own: They have an identity, a type, and a value.
id(to_words)
139936685526208
type(to_words)
dict
to_words
{0: 'zero', 1: 'one', 2: 'two'}
id(from_words)
139936686018688
type(from_words)
dict
from_words
{'zero': 0, 'one': 1, 'two': 2}
The built-in dict() constructor gives us an alternative way to create a
dict
object. It is versatile and can be used in different ways.
First, we may pass it any mapping type, for example, a dict
object, to obtain a new dict
object. That is the easiest way to obtain a shallow copy of a dict
object or convert any other mapping object into a dict
one.
dict(from_words)
{'zero': 0, 'one': 1, 'two': 2}
Second, we may pass it a finite iterable
providing iterables with two elements each. So, both of the following two code cells work: A list
of tuple
objects, or a tuple
of list
objects. More importantly, we could use an iterator, for example, a generator
object, that produces the inner iterables "on the fly."
dict([("zero", 0), ("one", 1), ("two", 2)])
{'zero': 0, 'one': 1, 'two': 2}
dict((["zero", 0], ["one", 1], ["two", 2]))
{'zero': 0, 'one': 1, 'two': 2}
Lastly, dict() may also be called with keyword arguments: The keywords become the keys and the arguments the values.
dict(zero=0, one=1, two=2)
{'zero': 0, 'one': 1, 'two': 2}
Often, dict
objects occur in a nested form and combined with other collection types, such as list
or tuple
objects, to model more complex entities "from the real world."
The reason for this popularity is that many modern ReST APIs on the internet (e.g., Google Maps API, Yelp API, Twilio API) provide their data in the popular JSON
format, which looks almost like a combination of
dict
and list
objects in Python.
The people
example below models three groups of people: "mathematicians"
, "physicists"
, and "programmers"
. Each person may have an arbitrary number of email addresses. In the example, Leonhard Euler has not lived long enough to get one whereas Guido
has more than one.
people
makes many implicit assumptions about the structure of the data. For example, there are a one-to-many relationship between a person and their email addresses and a one-to-one
relationship between each person and their name.
people = {
"mathematicians": [
{
"name": "Gilbert Strang",
"emails": ["gilbert@mit.edu"],
},
{
"name": "Leonhard Euler",
"emails": [],
},
],
"physicists": [],
"programmers": [
{
"name": "Guido",
"emails": ["guido@python.org", "guido@dropbox.com"],
},
],
}
The literal notation of such a nested dict
object may be hard to read ...
people
{'mathematicians': [{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']}, {'name': 'Leonhard Euler', 'emails': []}], 'physicists': [], 'programmers': [{'name': 'Guido', 'emails': ['guido@python.org', 'guido@dropbox.com']}]}
... but the pprint module in the standard library
provides a pprint()
function for "pretty printing."
from pprint import pprint
pprint(people, indent=1, width=60)
{'mathematicians': [{'emails': ['gilbert@mit.edu'], 'name': 'Gilbert Strang'}, {'emails': [], 'name': 'Leonhard Euler'}], 'physicists': [], 'programmers': [{'emails': ['guido@python.org', 'guido@dropbox.com'], 'name': 'Guido'}]}
In Chapter 0 , we argue that a major advantage of using Python is that it takes care of the memory managment for us. In line with that, we have never talked about the C level implementation thus far in the book. However, the
dict
type, among others, exhibits some behaviors that may seem "weird" for a beginner. To build some intuition, we describe the underlying implementation details on a conceptual level.
The first unintuitive behavior is that we may not use a mutable object as a key. That results in a TypeError
.
{
["zero", "one"]: [0, 1],
}
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[17], line 1 ----> 1 { 2 ["zero", "one"]: [0, 1], 3 } TypeError: unhashable type: 'list'
Similarly surprising is that items with the same key get merged together. The resulting dict
object keeps the position of the first mention of the "zero"
key while only the last mention of the corresponding values, 999
, survives.
{
"zero": 0,
"one": 1,
"two": 2,
"zero": 999, # to illustrate a point
}
{'zero': 999, 'one': 1, 'two': 2}
The reason for that is that the dict
type is implemented with so-called hash tables .
Conceptually, when we create a new dict
object, Python creates a "bag" in memory that takes significantly more space than needed to store the references to all the key and value objects. This bag is a contiguous array similar to the list
type's implementation. Whereas in the list
case the array is divided into equally sized slots capable of holding one reference, a dict
object's array is divided into equally sized buckets with enough space to store two references each: One for an item's key and one for the mapped value. The buckets are labeled with index numbers. Because Python knows how wide each bucket, it can jump directly into any bucket by calculating its offset from the start.
The figure below visualizes how we should think of hash tables. An empty dict
object, created with the literal {}
, still takes a lot of memory: It is essentially one big, contiguous, and empty table.
Bucket | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Key | ... | ... | ... | ... | ... | ... | ... | ... |
Value | ... | ... | ... | ... | ... | ... | ... | ... |
To insert a key-value pair, the key must be translated into a bucket's index.
As the first step to do so, the built-in hash() function maps any hashable object to its hash value, a long and "random"
int
number, similar to the ones returned by the built-in id() function. This hash value is a summary of all the 0s and 1s inside the object.
According to the official glossary , an object is hashable only if "it has a hash value which never changes during its lifetime." So, hashability implies immutability! Without this formal requirement an object may end up in different buckets depending on its current value. As the name of the
dict
type (i.e., "dictionary") suggests, a primary purpose of it is to insert objects and look them up later on. Without a unique bucket, this is of course not doable. The exact logic behind hash() is beyond the scope of this book.
Let's calculate the hash value of "zero"
, an immutable str
object. Hash values have no semantic meaning. Also, every time we re-start Python, we see different hash values for the same objects. That is a security measure, and we do not go into the technicalities here (cf. source ).
hash("zero")
-85344695604937002
For numeric objects, we can sometimes predict the hash values. However, we must never interpret any meaning into them.
hash(0)
0
hash(0.0)
0
The glossary states a second requirement for hashability, namely that "objects which compare equal must have the same hash value." The purpose of this is to ensure that if we put, for example,
1
as a key in a dict
object, we can look it up later with 1.0
. In other words, we can look up keys by their object's semantic value. The converse statement does not hold: Two objects may (accidentally) have the same hash value and not compare equal. However, that rarely happens.
1 == 1.0
True
hash(1) == hash(1.0)
True
Because list
objects are not immutable, they are never hashable, as indicated by the TypeError
.
hash(["zero", "one"])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[24], line 1 ----> 1 hash(["zero", "one"]) TypeError: unhashable type: 'list'
If we need keys composed of several objects, we can use tuple
objects instead.
hash(("zero", "one"))
-1616807732336770172
There is no such restiction on objects inserted into dict
objects as values.
{
("zero", "one"): [0, 1],
}
{('zero', 'one'): [0, 1]}
After obtaining the key object's hash value, Python must still convert that into a bucket index. We do not cover this step in technical detail but provide a conceptual description of it.
The buckets()
function below shows how we can obtain indexes from the binary representation of a hash value by simply extracting its least significant bits
and interpreting them as index numbers. Alternatively, the hash value may also be divided with the %
operator by the number of available buckets. We show this idea in the buckets_alt()
function that takes the number of buckets, n_buckets
, as its second argument.
def buckets(mapping, *, bits):
"""Calculate the bucket indices for a mapping's keys."""
for key in mapping: # cf., next section for details on looping
hash_value = hash(key)
binary = bin(hash_value)
address = binary[-bits:]
bucket = int("0b" + address, base=2)
print(key, hash_value, "0b..." + binary[-8:], address, bucket, sep="\t")
def buckets_alt(mapping, *, n_buckets):
"""Calculate the bucket indices for a mapping's keys."""
for key in mapping: # cf., next section for details on looping
hash_value = hash(key)
bucket = hash_value % n_buckets
print(key, hash_value, bucket, sep="\t")
With an infinite number of possible keys being mapped to a limited number of buckets, there is a realistic chance that two or more keys end up in the same bucket. That is called a hash collision. In such cases, Python uses a perturbation rule to rearrange the bits, and if the corresponding next bucket is empty, places an item there. Then, the nice offsetting logic from above breaks down and Python needs more time on average to place items into a hash table or look them up. The remedy is to use a bigger hash table as then the chance of collisions decreases. Python does all that for us in the background, and the main cost we pay for that is a high memory usage of dict
objects in general.
Because keys with the same semantic value have the same hash value, they end up in the same bucket. That is why the item that gets inserted last overwrites all previously inserted items whose keys compare equal, as we saw with the two "zero"
keys above.
Thus, to come up with indexes for 4 buckets, we need to extract 2 bits from the hash value (i.e., 22=4).
buckets(from_words, bits=2)
zero -85344695604937002 0b...00101010 10 2 one 6414592332130781825 0b...10000001 01 1 two 4316247523642253857 0b...00100001 01 1
buckets_alt(from_words, n_buckets=4)
zero -85344695604937002 2 one 6414592332130781825 1 two 4316247523642253857 1
Similarly, 3 bits provide indexes for 8 buckets (i.e., 23=8) ...
buckets(from_words, bits=3)
zero -85344695604937002 0b...00101010 010 2 one 6414592332130781825 0b...10000001 001 1 two 4316247523642253857 0b...00100001 001 1
buckets_alt(from_words, n_buckets=8)
zero -85344695604937002 6 one 6414592332130781825 1 two 4316247523642253857 1
... while 4 bits do so for 16 buckets (i.e., 24=16).
buckets(from_words, bits=4)
zero -85344695604937002 0b...00101010 1010 10 one 6414592332130781825 0b...10000001 0001 1 two 4316247523642253857 0b...00100001 0001 1
buckets_alt(from_words, n_buckets=16)
zero -85344695604937002 6 one 6414592332130781825 1 two 4316247523642253857 1
Python allocates the memory for a dict
object's hash table according to some internal heuristics: Whenever a hash table is roughly 2/3 full, it creates a new one with twice the space, and re-inserts all items, one by one, from the old one. So, during its lifetime, a dict
object may have several hash tables.
Although hash tables seem quite complex at first sight, they help us to make certain operations very fast as we see further below.
In Chapter 7 , we see how a sequence is a special kind of a collection, and that collections can be described as
The dict
type is another collection type and has these three properties as well.
For example, we may pass to_words
or from_words
to the built-in len() function to obtain the number of items they contain. In the terminology of the collections.abc
module in the standard library
, both are
Sized
objects.
len(to_words)
3
len(from_words)
3
Also, dict
objects may be looped over, for example, with the for
statement. So, in the terminology of the collections.abc module, they are
Iterable
objects.
Regarding the iteration order things are not that easy, and programmers seem to often be confused about this (e.g., this discussion). The confusion usually comes from one of two reasons:
dict
type has been changed over the last couple of minor release versions, and the communication thereof in the official release notes was done only in a later version. In a nutshell, before Python 3.6, the core developers did not care about the iteration order at all as the goal was to optimize dict
objects for computational speed, primarily regarding key look-up (cf., the "Indexing -> Key Look-up" section below). That meant that looping over the same dict
object several times during its lifetime could have resulted in different iteration orders. In Python 3.6, it was discovered that it is possible to make dict
objects remember the order in that items have been inserted without giving up any computational speed or memory (cf., Raymond Hettinger's talk in the Further Resources Reversible
ABC in the collections.abc dict
is remembered for Python 3.6 and 3.7, dict
objects are not Reversible
. That was then changed in Python 3.8, but again not officially communicated (cf., Python 3.8 release notes).In summary, we can say that depending on the exact Python version a dict
object may remember the insertion order of its items.
However, that order is only apparent to us (i.e., we could look it up) if we put the data stored in a dict
object into the source code itself. Then, we say that we "hard code" the data in our program. That is often not useful as we want our software load the data to be processed, for example, from a file or a database.
Therefore, we suggest and adopt the following best practices in this book:
dict
object are not in a predictable order and never make the correctness of the logic in our code dependent on it.dict
-like objects with an explicit order, we use the OrderedDict If you installed Python, as recommended, via the Anaconda Distribution, the order in the two for
-loops below is the same as in the source code that defines to_words
and from_words
above. In that sense, it is predictable.
!python --version # the order in the for-loops is predictable only for Python 3.7 or higher
Python 3.12.2
By convention, iteration goes over the keys in the dict
object only. The "Dictionary Methods" section below shows how to loop over the items or the values instead.
for number in to_words:
print(number)
0 1 2
for word in from_words:
print(word)
zero one two
For Python 3.8, dict
objects are Reversible
as well. So, passing a dict
object to the reversed() built-in works. However, for ealier Python versions, the two next cells raise a
TypeError
.
for number in reversed(to_words):
print(number)
2 1 0
for word in reversed(from_words):
print(word)
two one zero
Of course, we may always use the built-in sorted() function to loop over, for example,
from_words
in a predictable order. However, that creates a temporary list
object in memory and an order that has nothing to do with how the items are ordered inside the dict
object.
for word in sorted(from_words):
print(word)
one two zero
To show the Container
behavior of collection types, we use the boolean in
operator to check if a given object evaluates equal to a key in to_words
or from_words
.
1.0 in to_words # 1.0 is not a key but compares equal to a key
True
-1 in to_words
False
"one" in from_words
True
"ten" in from_words
False
list
vs. dict
¶Because of the hash table implementation, the
in
operator is extremely fast: Python does not need to initiate a linear search as in the
list
case but immediately knows the only places in memory where the searched object must be located if present in the hash table at all. Then, the Python interpreter jumps right there in only one step. Because that is true no matter how many items are in the hash table, we call that a constant time operation.
Conceptually, the overall behavior of the in
operator is like comparing the searched object against all key objects with the ==
operator without doing it.
To show the speed, we run an experiment. We create a haystack
, a list
object, with 10_000_001
elements in it, one of which is the needle
, namely 42
. Once again, the randint() function in the random
module is helpful.
import random
random.seed(87)
needle = 42
haystack = [random.randint(99, 9999) for _ in range(10_000_000)]
haystack.append(needle)
random.shuffle(haystack)
haystack[:10]
[8126, 7370, 3735, 213, 7922, 1434, 8557, 9609, 9704, 9564]
haystack[-10:]
[7237, 886, 5945, 4014, 4998, 2055, 3531, 6919, 7875, 1944]
As modern computers are generally fast, we search the haystack
a total of 10
times.
%%timeit -n 1 -r 1
for _ in range(10):
needle in haystack
4.44 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Now, we convert the elements of the haystack
into the keys of a magic_haystack
, a dict
object. We use None
as a dummy value for all items.
magic_haystack = dict((x, None) for x in haystack)
To show the massive effect of the hash table implementation, we search the magic_haystack
not 10
but 10_000_000
times. The code cell still runs in only a fraction of the time its counterpart does above.
%%timeit -n 1 -r 1
for _ in range(10_000_000):
needle in magic_haystack
560 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
However, there is no fast way to look up the values the keys are mapped to. To achieve that, we have to loop over all items and check for each value object if it compares equal to the searched object. That is, by definition, a linear search, as well, and rather slow. In the context of dict
objects, we call that a reverse look-up.
The same efficient key look-up executed in the background with the in
operator is also behind the indexing operator []
. Instead of returning either True
or False
, it returns the value object the looked up key maps to.
To show the similarity to indexing into list
objects, we provide another example with to_words_list
.
to_words_list = ["zero", "one", "two"]
Without the above definitions, we could not tell the difference between to_words
and to_words_list
: The usage of the []
is the same.
to_words[0]
'zero'
to_words_list[0]
'zero'
Because key objects can be of any immutable type and are, in particular, not constrained to just the int
type, the word "indexing" is an understatement here. Therefore, in the context of dict
objects, we view the []
operator as a generalization of the indexing operator and refer to it as the (key) look-up operator.
from_words["two"]
2
If a key is not in a dict
object, Python raises a KeyError
. A sequence type would raise an IndexError
in this situation.
from_words["drei"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[61], line 1 ----> 1 from_words["drei"] KeyError: 'drei'
While dict
objects support the []
operator to look up a single key, the more general concept of slicing is not available. That is in line with the idea that there is no predictable order associated with a dict
object's keys, and slicing requires an order.
To access deeper levels in nested data, like people
, we chain the look-up operator []
. For example, let's view all the "mathematicians"
in people
.
people["mathematicians"]
[{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']}, {'name': 'Leonhard Euler', 'emails': []}]
Let's take the first mathematician on the list, ...
people["mathematicians"][0]
{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']}
... and output his "name"
...
people["mathematicians"][0]["name"]
'Gilbert Strang'
... or his "emails"
.
people["mathematicians"][0]["emails"]
['gilbert@mit.edu']
We may mutate dict
objects in place.
For example, let's translate the English words in to_words
to their German counterparts. Behind the scenes, Python determines the bucket of the objects passed to the []
operator, looks them up in the hash table, and, if present, updates the references to the mapped value objects.
to_words
{0: 'zero', 1: 'one', 2: 'two'}
to_words[0] = "null"
to_words[1] = "eins"
to_words[2] = "zwei"
to_words
{0: 'null', 1: 'eins', 2: 'zwei'}
Let's add two more items. Again, Python determines their buckets, but this time finds them to be empty, and inserts the references to their key and value objects.
to_words[3] = "drei"
to_words[4] = "vier"
to_words
{0: 'null', 1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}
None of these operations change the identity of the to_words
object.
id(to_words) # same memory location as before
139936685526208
The del
statement removes individual items. Python just removes the two references to the key and value objects in the corresponding bucket.
del to_words[0]
to_words
{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}
We may also change parts of nested data, such as people
.
For example, let's add Albert Einstein to the list of
"physicists"
, ...
people["physicists"]
[]
people["physicists"].append({"name": "Albert Einstein"})
... complete Guido's "name"
, ...
people["programmers"][0]
{'name': 'Guido', 'emails': ['guido@python.org', 'guido@dropbox.com']}
people["programmers"][0]["name"] = "Guido van Rossum"
... and remove his work email because he retired.
del people["programmers"][0]["emails"][1]
Now, people
looks like this.
pprint(people, indent=1, width=60)
{'mathematicians': [{'emails': ['gilbert@mit.edu'], 'name': 'Gilbert Strang'}, {'emails': [], 'name': 'Leonhard Euler'}], 'physicists': [{'name': 'Albert Einstein'}], 'programmers': [{'emails': ['guido@python.org'], 'name': 'Guido van Rossum'}]}
dict
Methods¶dict
objects come with many methods bound on them (cf., documentation ), many of which are standardized by the
Mapping
and MutableMapping
ABCs from the collections.abc module. While the former requires the .keys()
, .values()
, .items()
, and .get()
methods, which never mutate an object, the latter formalizes the .update()
, .pop()
, .popitem()
, .clear()
, and .setdefault()
methods, which may do so.
import collections.abc as abc
isinstance(from_words, abc.Mapping)
True
isinstance(from_words, abc.MutableMapping)
True
While iteration over a mapping type already goes over its keys, we may emphasize this explicitly by adding the .keys() method in the
for
-loop. Again, the iteration order is equivalent to the insertion order but still considered unpredictable.
for word in from_words.keys():
print(word)
zero one two
.keys() returns an object of type
dict_keys
. That is a dynamic view inside the from_words
's hash table, which means it does not copy the references to the keys, and changes to from_words
can be seen through it. View objects behave much like dict
objects themselves.
from_words.keys()
dict_keys(['zero', 'one', 'two'])
Views can be materialized with the list() built-in. However, that may introduce semantic errors into a program as the newly created
list
object has a "predictable" order (i.e., indexes 0
, 1
, ...) created from an unpredictable one.
list(from_words.keys())
['zero', 'one', 'two']
To loop over the value objects instead, we use the .values() method. That returns a view (i.e., type
dict_values
) on the value objects inside from_words
without copying them.
for number in from_words.values():
print(number)
0 1 2
from_words.values()
dict_values([0, 1, 2])
To loop over key-value pairs, we invoke the .items() method. That returns a view (i.e., type
dict_items
) on the key-value pairs as tuple
objects, where the first element is the key and the second the value. Because of that, we use tuple unpacking in the for
-loop.
for word, number in from_words.items():
print(f"{word} -> {number}")
zero -> 0 one -> 1 two -> 2
from_words.items()
dict_items([('zero', 0), ('one', 1), ('two', 2)])
Above, we see how the look-up operator fails loudly with a KeyError
if a key is not in a dict
object. For example, to_words
does not have a key 0
any more.
to_words
{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}
to_words[0]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[91], line 1 ----> 1 to_words[0] KeyError: 0
That may be mitigated with the .get() method that takes two arguments:
key
and default
. It returns the value object key
maps to if it is in the dict
object; otherwise, default
is returned. If not provided, default
is None
.
to_words.get(0, "n/a")
'n/a'
to_words.get(1, "n/a")
'eins'
to_spanish = {
0: "cero",
1: "uno",
2: "dos",
3: "tres",
4: "cuatro",
5: "cinco",
}
to_words.update(to_spanish)
to_words
{1: 'uno', 2: 'dos', 3: 'tres', 4: 'cuatro', 0: 'cero', 5: 'cinco'}
In contrast to the pop()
method of the list
type, the .pop() method of the
dict
type requires a key
argument to be passed. Then, it removes the corresponding key-value pair and returns the value object. If the key
is not in the dict
object, a KeyError
is raised.
from_words
{'zero': 0, 'one': 1, 'two': 2}
number = from_words.pop("zero")
number
0
from_words
{'one': 1, 'two': 2}
With an optional default
argument, the loud KeyError
may be suppressed and the default
returned instead, just as with the .get() method above.
from_words.pop("zero")
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[101], line 1 ----> 1 from_words.pop("zero") KeyError: 'zero'
from_words.pop("zero", 0)
0
Similar to the pop()
method of the list
type, the .popitem() method of the
dict
type removes and returns an "arbitrary" key-value pair as a tuple
object from a dict
object. With the preservation of the insertion order in Python 3.7 and higher, this effectively becomes a "last in, first out" rule, just as with the list
type. Once a dict
object is empty, .popitem() raises a
KeyError
.
word, number = from_words.popitem()
word, number
('two', 2)
from_words
{'one': 1}
The .clear() method removes all items but keeps the
dict
object alive in memory.
to_words.clear()
to_words
{}
from_words.clear()
from_words
{}
The .setdefault() method may have a bit of an unfortunate name but is useful, in particular, with nested
list
objects. It takes two arguments, key
and default
, and returns the value mapped to key
if key
is in the dict
object; otherwise, it inserts the key
-default
pair and returns a reference to the newly created value object. So, it is similar to the .get() method above but also mutates the
dict
object.
Consider the people
example again and note how the dict
object modeling "Albert Einstein"
has no "emails"
key in it.
pprint(people, indent=1, width=60)
{'mathematicians': [{'emails': ['gilbert@mit.edu'], 'name': 'Gilbert Strang'}, {'emails': [], 'name': 'Leonhard Euler'}], 'physicists': [{'name': 'Albert Einstein'}], 'programmers': [{'emails': ['guido@python.org'], 'name': 'Guido van Rossum'}]}
Let's say we want to append the imaginary emails "leonhard@math.org"
and "albert@physics.org"
. We cannot be sure if a dict
object modeling a person has already a "emails"
key or not. To play it safe, we could first use the in
operator to check for that and create a new list
object in a second step if one is missing. Then, we would finally append the new email.
.setdefault() allows us to do all of the three steps at once. More importantly, behind the scenes Python only needs to make one key look-up instead of potentially three. For large nested data, that could speed up the computations significantly.
So, the first code cell below adds the email to the already existing empty list
object, while the second one creates a new one first.
people["mathematicians"][1].setdefault("emails", []).append("leonhard@math.org")
people["physicists"][0].setdefault("emails", []).append("albert@physics.org")
pprint(people, indent=1, width=60)
{'mathematicians': [{'emails': ['gilbert@mit.edu'], 'name': 'Gilbert Strang'}, {'emails': ['leonhard@math.org'], 'name': 'Leonhard Euler'}], 'physicists': [{'emails': ['albert@physics.org'], 'name': 'Albert Einstein'}], 'programmers': [{'emails': ['guido@python.org'], 'name': 'Guido van Rossum'}]}
dict
objects also come with a copy() method on them that creates shallow copies.
guido = people["programmers"][0].copy()
guido
{'name': 'Guido van Rossum', 'emails': ['guido@python.org']}
If we mutate guido
and, for example, remove all his emails with the .clear()
method on the list
type, these changes are also visible through people
.
guido["emails"].clear()
guido
{'name': 'Guido van Rossum', 'emails': []}
pprint(people, indent=1, width=60)
{'mathematicians': [{'emails': ['gilbert@mit.edu'], 'name': 'Gilbert Strang'}, {'emails': ['leonhard@math.org'], 'name': 'Leonhard Euler'}], 'physicists': [{'emails': ['albert@physics.org'], 'name': 'Albert Einstein'}], 'programmers': [{'emails': [], 'name': 'Guido van Rossum'}]}
dict
Comprehensions¶Analogous to list
comprehensions in Chapter 8 ,
dict
comprehensions are a concise literal notation to derive new dict
objects out of existing ones.
For example, let's derive from_words
out of to_words
below by swapping the keys and values.
to_words = {
0: "zero",
1: "one",
2: "two",
}
Without a dictionary comprehension, we would have to initialize an empty dict
object, loop over the items of the original one, and insert the key-value pairs one by one in a reversed fashion as value-key pairs. That assumes that the values are unique as otherwise some would be merged.
from_words = {}
for number, word in to_words.items():
from_words[word] = number
from_words
{'zero': 0, 'one': 1, 'two': 2}
While that code is correct, it is also unnecessarily verbose. The dictionary comprehension below works in the same way as list comprehensions except that curly braces {}
replace the brackets []
and a colon :
is added to separate the keys from the values.
{v: k for k, v in to_words.items()}
{'zero': 0, 'one': 1, 'two': 2}
We may filter out items with an if
-clause and transform the remaining key and value objects.
For no good reason, let's filter out all words starting with a "z"
and upper case the remainin words.
{v.upper(): k for k, v in to_words.items() if not v.startswith("z")}
{'ONE': 1, 'TWO': 2}
Multiple for
- and/or if
-clauses are allowed.
For example, let's find all pairs of two numbers from 1
through 10
whose product is "close" to 50
(e.g., within a delta of 5
).
{
(x, y): x * y
for x in range(1, 11) for y in range(1, 11)
if abs(x * y - 50) <= 5
}
{(5, 9): 45, (5, 10): 50, (6, 8): 48, (6, 9): 54, (7, 7): 49, (8, 6): 48, (9, 5): 45, (9, 6): 54, (10, 5): 50}