Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .

Chapter 9: Mappings & Sets¶

While Chapter 7 focuses on one special kind of collection types, namely sequences, this chapter introduces two more kinds: Mappings and sets. Both are presented in this chapter as they share the same underlying implementation.

The dict type (cf, documentation ) introduced in the next section is an essential part in a data scientist's toolbox for two reasons: First, Python employs dict objects basically everywhere internally. Second, after the many concepts involving sequential data, mappings provide a different perspective on data and enhance our general problem solving skills.

The `dict` Type¶

A mapping is a one-to-one correspondence from a set of keys to a set of values. In other words, a mapping is a collection of key-value pairs, also called items for short.

In the context of mappings, the term value has a meaning different from the value every object has: In the "bag" analogy from Chapter 1 , we describe an object's value to be the semantic meaning of the $0$ s and $1$ s it contains. Here, the terms key and value mean the role an object takes within a mapping. Both, keys and values, are objects on their own with distinct values.

Let's continue with an example. To create a dict object, we commonly use the literal notation, {..: .., ..: .., ...}, and list all the items. to_words below maps the int objects 0, 1, and 2 to their English word equivalents, "zero", "one", and "two", and from_words does the opposite. A stylistic side note: Pythonistas often expand dict or list definitions by writing each item or element on a line on their own. Also, the commas , after the respective last items, 2: "two" and "two": 2, are not a mistake although they may be left out. Besides easier reading, such a style has technical advantages that we do not go into detail about here (cf., source ).

In [1]:

to_words = {
    0: "zero",
    1: "one",
    2: "two",
}

In [2]:

from_words = {
    "zero": 0,
    "one": 1,
    "two": 2,
}

As before, dict objects are objects on their own: They have an identity, a type, and a value.

In [3]:

id(to_words)

Out[3]:

139936685526208

In [4]:

type(to_words)

Out[4]:

dict

In [5]:

to_words

Out[5]:

{0: 'zero', 1: 'one', 2: 'two'}

In [6]:

id(from_words)

Out[6]:

139936686018688

In [7]:

type(from_words)

Out[7]:

dict

In [8]:

from_words

Out[8]:

{'zero': 0, 'one': 1, 'two': 2}

The built-in dict() constructor gives us an alternative way to create a dict object. It is versatile and can be used in different ways.

First, we may pass it any mapping type, for example, a dict object, to obtain a new dict object. That is the easiest way to obtain a shallow copy of a dict object or convert any other mapping object into a dict one.

In [9]:

dict(from_words)

Out[9]:

{'zero': 0, 'one': 1, 'two': 2}

Second, we may pass it a finite iterable providing iterables with two elements each. So, both of the following two code cells work: A list of tuple objects, or a tuple of list objects. More importantly, we could use an iterator, for example, a generator object, that produces the inner iterables "on the fly."

In [10]:

dict([("zero", 0), ("one", 1), ("two", 2)])

Out[10]:

{'zero': 0, 'one': 1, 'two': 2}

In [11]:

dict((["zero", 0], ["one", 1], ["two", 2]))

Out[11]:

{'zero': 0, 'one': 1, 'two': 2}

Lastly, dict() may also be called with keyword arguments: The keywords become the keys and the arguments the values.

In [12]:

dict(zero=0, one=1, two=2)

Out[12]:

{'zero': 0, 'one': 1, 'two': 2}

Nested Data¶

Often, dict objects occur in a nested form and combined with other collection types, such as list or tuple objects, to model more complex entities "from the real world."

The reason for this popularity is that many modern ReST APIs on the internet (e.g., Google Maps API, Yelp API, Twilio API) provide their data in the popular JSON format, which looks almost like a combination of dict and list objects in Python.

The people example below models three groups of people: "mathematicians", "physicists", and "programmers". Each person may have an arbitrary number of email addresses. In the example, Leonhard Euler has not lived long enough to get one whereas Guido has more than one.

people makes many implicit assumptions about the structure of the data. For example, there are a one-to-many relationship between a person and their email addresses and a one-to-one relationship between each person and their name.

In [13]:

people = {
    "mathematicians": [
        {
            "name": "Gilbert Strang",
            "emails": ["gilbert@mit.edu"],
        },
        {
            "name": "Leonhard Euler",
            "emails": [],
        },
    ],
    "physicists": [],
    "programmers": [
        {
            "name": "Guido",
            "emails": ["guido@python.org", "guido@dropbox.com"],
        },
    ],
}

The literal notation of such a nested dict object may be hard to read ...

In [14]:

people

Out[14]:

{'mathematicians': [{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']},
  {'name': 'Leonhard Euler', 'emails': []}],
 'physicists': [],
 'programmers': [{'name': 'Guido',
   'emails': ['guido@python.org', 'guido@dropbox.com']}]}

... but the pprint module in the standard library provides a pprint() function for "pretty printing."

In [15]:

from pprint import pprint

In [16]:

pprint(people, indent=1, width=60)

{'mathematicians': [{'emails': ['gilbert@mit.edu'],
                     'name': 'Gilbert Strang'},
                    {'emails': [],
                     'name': 'Leonhard Euler'}],
 'physicists': [],
 'programmers': [{'emails': ['guido@python.org',
                             'guido@dropbox.com'],
                  'name': 'Guido'}]}

Hash Tables & (Key) Hashability¶

In Chapter 0 , we argue that a major advantage of using Python is that it takes care of the memory managment for us. In line with that, we have never talked about the C level implementation thus far in the book. However, the dict type, among others, exhibits some behaviors that may seem "weird" for a beginner. To build some intuition, we describe the underlying implementation details on a conceptual level.

The first unintuitive behavior is that we may not use a mutable object as a key. That results in a TypeError.

In [17]:

{
    ["zero", "one"]: [0, 1],
}

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 {
      2     ["zero", "one"]: [0, 1],
      3 }

TypeError: unhashable type: 'list'

Similarly surprising is that items with the same key get merged together. The resulting dict object keeps the position of the first mention of the "zero" key while only the last mention of the corresponding values, 999, survives.

In [18]:

{
    "zero": 0,
    "one": 1,
    "two": 2,
    "zero": 999,  # to illustrate a point
}

Out[18]:

{'zero': 999, 'one': 1, 'two': 2}

The reason for that is that the dict type is implemented with so-called hash tables .

Conceptually, when we create a new dict object, Python creates a "bag" in memory that takes significantly more space than needed to store the references to all the key and value objects. This bag is a contiguous array similar to the list type's implementation. Whereas in the list case the array is divided into equally sized slots capable of holding one reference, a dict object's array is divided into equally sized buckets with enough space to store two references each: One for an item's key and one for the mapped value. The buckets are labeled with index numbers. Because Python knows how wide each bucket, it can jump directly into any bucket by calculating its offset from the start.

The figure below visualizes how we should think of hash tables. An empty dict object, created with the literal {}, still takes a lot of memory: It is essentially one big, contiguous, and empty table.

Bucket	0	1	2	3	4	5	6	7
Key	...	...	...	...	...	...	...	...
Value	...	...	...	...	...	...	...	...

To insert a key-value pair, the key must be translated into a bucket's index.

As the first step to do so, the built-in hash() function maps any hashable object to its hash value, a long and "random" int number, similar to the ones returned by the built-in id() function. This hash value is a summary of all the $0$ s and $1$ s inside the object.

According to the official glossary , an object is hashable only if "it has a hash value which never changes during its lifetime." So, hashability implies immutability! Without this formal requirement an object may end up in different buckets depending on its current value. As the name of the dict type (i.e., "dictionary") suggests, a primary purpose of it is to insert objects and look them up later on. Without a unique bucket, this is of course not doable. The exact logic behind hash() is beyond the scope of this book.

Let's calculate the hash value of "zero", an immutable str object. Hash values have no semantic meaning. Also, every time we re-start Python, we see different hash values for the same objects. That is a security measure, and we do not go into the technicalities here (cf. source ).

In [19]:

hash("zero")

Out[19]:

-85344695604937002

For numeric objects, we can sometimes predict the hash values. However, we must never interpret any meaning into them.

In [20]:

hash(0)

Out[20]:

In [21]:

hash(0.0)

Out[21]:

The glossary states a second requirement for hashability, namely that "objects which compare equal must have the same hash value." The purpose of this is to ensure that if we put, for example, 1 as a key in a dict object, we can look it up later with 1.0. In other words, we can look up keys by their object's semantic value. The converse statement does not hold: Two objects may (accidentally) have the same hash value and not compare equal. However, that rarely happens.

In [22]:

1 == 1.0

Out[22]:

True

In [23]:

hash(1) == hash(1.0)

Out[23]:

True

Because list objects are not immutable, they are never hashable, as indicated by the TypeError.

In [24]:

hash(["zero", "one"])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 hash(["zero", "one"])

TypeError: unhashable type: 'list'

If we need keys composed of several objects, we can use tuple objects instead.

In [25]:

hash(("zero", "one"))

Out[25]:

-1616807732336770172

There is no such restiction on objects inserted into dict objects as values.

In [26]:

{
    ("zero", "one"): [0, 1],
}

Out[26]:

{('zero', 'one'): [0, 1]}

After obtaining the key object's hash value, Python must still convert that into a bucket index. We do not cover this step in technical detail but provide a conceptual description of it.

The buckets() function below shows how we can obtain indexes from the binary representation of a hash value by simply extracting its least significant bits and interpreting them as index numbers. Alternatively, the hash value may also be divided with the % operator by the number of available buckets. We show this idea in the buckets_alt() function that takes the number of buckets, n_buckets, as its second argument.

In [27]:

def buckets(mapping, *, bits):
    """Calculate the bucket indices for a mapping's keys."""
    for key in mapping:  # cf., next section for details on looping
        hash_value = hash(key)
        binary = bin(hash_value)
        address = binary[-bits:]
        bucket = int("0b" + address, base=2)
        print(key, hash_value, "0b..." + binary[-8:], address, bucket, sep="\t")

In [28]:

def buckets_alt(mapping, *, n_buckets):
    """Calculate the bucket indices for a mapping's keys."""
    for key in mapping:  # cf., next section for details on looping
        hash_value = hash(key)
        bucket = hash_value % n_buckets
        print(key, hash_value, bucket, sep="\t")

With an infinite number of possible keys being mapped to a limited number of buckets, there is a realistic chance that two or more keys end up in the same bucket. That is called a hash collision. In such cases, Python uses a perturbation rule to rearrange the bits, and if the corresponding next bucket is empty, places an item there. Then, the nice offsetting logic from above breaks down and Python needs more time on average to place items into a hash table or look them up. The remedy is to use a bigger hash table as then the chance of collisions decreases. Python does all that for us in the background, and the main cost we pay for that is a high memory usage of dict objects in general.

Because keys with the same semantic value have the same hash value, they end up in the same bucket. That is why the item that gets inserted last overwrites all previously inserted items whose keys compare equal, as we saw with the two "zero" keys above.

Thus, to come up with indexes for 4 buckets, we need to extract 2 bits from the hash value (i.e., $2^2 = 4$ ).

In [29]:

buckets(from_words, bits=2)

zero	-85344695604937002	0b...00101010	10	2
one	6414592332130781825	0b...10000001	01	1
two	4316247523642253857	0b...00100001	01	1

In [30]:

buckets_alt(from_words, n_buckets=4)

zero	-85344695604937002	2
one	6414592332130781825	1
two	4316247523642253857	1

Similarly, 3 bits provide indexes for 8 buckets (i.e., $2^3 = 8$ ) ...

In [31]:

buckets(from_words, bits=3)

zero	-85344695604937002	0b...00101010	010	2
one	6414592332130781825	0b...10000001	001	1
two	4316247523642253857	0b...00100001	001	1

In [32]:

buckets_alt(from_words, n_buckets=8)

zero	-85344695604937002	6
one	6414592332130781825	1
two	4316247523642253857	1

... while 4 bits do so for 16 buckets (i.e., $2^4 = 16$ ).

In [33]:

buckets(from_words, bits=4)

zero	-85344695604937002	0b...00101010	1010	10
one	6414592332130781825	0b...10000001	0001	1
two	4316247523642253857	0b...00100001	0001	1

In [34]:

buckets_alt(from_words, n_buckets=16)

zero	-85344695604937002	6
one	6414592332130781825	1
two	4316247523642253857	1

Python allocates the memory for a dict object's hash table according to some internal heuristics: Whenever a hash table is roughly 2/3 full, it creates a new one with twice the space, and re-inserts all items, one by one, from the old one. So, during its lifetime, a dict object may have several hash tables.

Although hash tables seem quite complex at first sight, they help us to make certain operations very fast as we see further below.

Mappings are Collections of Key-Value Pairs¶

In Chapter 7 , we see how a sequence is a special kind of a collection, and that collections can be described as

iterable
containers
with a finite number of elements.

The dict type is another collection type and has these three properties as well.

For example, we may pass to_words or from_words to the built-in len() function to obtain the number of items they contain. In the terminology of the collections.abc module in the standard library , both are Sized objects.

In [35]:

len(to_words)

Out[35]:

In [36]:

len(from_words)

Out[36]:

Also, dict objects may be looped over, for example, with the for statement. So, in the terminology of the collections.abc module, they are Iterable objects.

Regarding the iteration order things are not that easy, and programmers seem to often be confused about this (e.g., this discussion). The confusion usually comes from one of two reasons:

The internal implementation of the dict type has been changed over the last couple of minor release versions, and the communication thereof in the official release notes was done only in a later version. In a nutshell, before Python 3.6, the core developers did not care about the iteration order at all as the goal was to optimize dict objects for computational speed, primarily regarding key look-up (cf., the "Indexing -> Key Look-up" section below). That meant that looping over the same dict object several times during its lifetime could have resulted in different iteration orders. In Python 3.6, it was discovered that it is possible to make dict objects remember the order in that items have been inserted without giving up any computational speed or memory (cf., Raymond Hettinger's talk in the Further Resources section at the end of the chapter. However, that change was kept an implementation detail and not made official in the release notes. That was then done in Python 3.7's release notes (cf., Python 3.7 release notes ).
To make order an official part of a data type, it must adhere to the Reversible ABC in the collections.abc module and support the reversed() built-in. Even though the items' order inside a dict is remembered for Python 3.6 and 3.7, dict objects are not Reversible. That was then changed in Python 3.8, but again not officially communicated (cf., Python 3.8 release notes).

In summary, we can say that depending on the exact Python version a dict object may remember the insertion order of its items.

However, that order is only apparent to us (i.e., we could look it up) if we put the data stored in a dict object into the source code itself. Then, we say that we "hard code" the data in our program. That is often not useful as we want our software load the data to be processed, for example, from a file or a database.

Therefore, we suggest and adopt the following best practices in this book:

We assume that the items in a dict object are not in a predictable order and never make the correctness of the logic in our code dependent on it.
Whenever we want or need to model data in dict-like objects with an explicit order, we use the OrderedDict type in the collections module in the standard library .

If you installed Python, as recommended, via the Anaconda Distribution, the order in the two for-loops below is the same as in the source code that defines to_words and from_words above. In that sense, it is predictable.

In [37]:

!python --version  # the order in the for-loops is predictable only for Python 3.7 or higher

Python 3.12.2

By convention, iteration goes over the keys in the dict object only. The "Dictionary Methods" section below shows how to loop over the items or the values instead.

In [38]:

for number in to_words:
    print(number)

0
1
2

In [39]:

for word in from_words:
    print(word)

zero
one
two

For Python 3.8, dict objects are Reversible as well. So, passing a dict object to the reversed() built-in works. However, for ealier Python versions, the two next cells raise a TypeError.

In [40]:

for number in reversed(to_words):
    print(number)

2
1
0

In [41]:

for word in reversed(from_words):
    print(word)

two
one
zero

Of course, we may always use the built-in sorted() function to loop over, for example, from_words in a predictable order. However, that creates a temporary list object in memory and an order that has nothing to do with how the items are ordered inside the dict object.

In [42]:

for word in sorted(from_words):
    print(word)

one
two
zero

To show the Container behavior of collection types, we use the boolean in operator to check if a given object evaluates equal to a key in to_words or from_words.

In [43]:

1.0 in to_words  # 1.0 is not a key but compares equal to a key

Out[43]:

True

In [44]:

-1 in to_words

Out[44]:

False

In [45]:

"one" in from_words

Out[45]:

True

In [46]:

"ten" in from_words

Out[46]:

False

Membership Testing: `list` vs. `dict`¶

Because of the hash table implementation, the in operator is extremely fast: Python does not need to initiate a linear search as in the list case but immediately knows the only places in memory where the searched object must be located if present in the hash table at all. Then, the Python interpreter jumps right there in only one step. Because that is true no matter how many items are in the hash table, we call that a constant time operation.

Conceptually, the overall behavior of the in operator is like comparing the searched object against all key objects with the == operator without doing it.

To show the speed, we run an experiment. We create a haystack, a list object, with 10_000_001 elements in it, one of which is the needle, namely 42. Once again, the randint() function in the random module is helpful.

In [47]:

import random

In [48]:

random.seed(87)

In [49]:

needle = 42

In [50]:

haystack = [random.randint(99, 9999) for _ in range(10_000_000)]
haystack.append(needle)

We put the elements in haystack in a random order with the shuffle() function in the random module.

In [51]:

random.shuffle(haystack)

In [52]:

haystack[:10]

Out[52]:

[8126, 7370, 3735, 213, 7922, 1434, 8557, 9609, 9704, 9564]

In [53]:

haystack[-10:]

Out[53]:

[7237, 886, 5945, 4014, 4998, 2055, 3531, 6919, 7875, 1944]

As modern computers are generally fast, we search the haystack a total of 10 times.

In [54]:

%%timeit -n 1 -r 1
for _ in range(10):
    needle in haystack

4.44 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Now, we convert the elements of the haystack into the keys of a magic_haystack, a dict object. We use None as a dummy value for all items.

In [55]:

magic_haystack = dict((x, None) for x in haystack)

To show the massive effect of the hash table implementation, we search the magic_haystack not 10 but 10_000_000 times. The code cell still runs in only a fraction of the time its counterpart does above.

In [56]:

%%timeit -n 1 -r 1
for _ in range(10_000_000):
    needle in magic_haystack

560 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

However, there is no fast way to look up the values the keys are mapped to. To achieve that, we have to loop over all items and check for each value object if it compares equal to the searched object. That is, by definition, a linear search, as well, and rather slow. In the context of dict objects, we call that a reverse look-up.

Indexing -> Key Look-up¶

The same efficient key look-up executed in the background with the in operator is also behind the indexing operator []. Instead of returning either True or False, it returns the value object the looked up key maps to.

To show the similarity to indexing into list objects, we provide another example with to_words_list.

In [57]:

to_words_list = ["zero", "one", "two"]

Without the above definitions, we could not tell the difference between to_words and to_words_list: The usage of the [] is the same.

In [58]:

to_words[0]

Out[58]:

'zero'

In [59]:

to_words_list[0]

Out[59]:

'zero'

Because key objects can be of any immutable type and are, in particular, not constrained to just the int type, the word "indexing" is an understatement here. Therefore, in the context of dict objects, we view the [] operator as a generalization of the indexing operator and refer to it as the (key) look-up operator.

In [60]:

from_words["two"]

Out[60]:

If a key is not in a dict object, Python raises a KeyError. A sequence type would raise an IndexError in this situation.

In [61]:

from_words["drei"]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[61], line 1
----> 1 from_words["drei"]

KeyError: 'drei'

While dict objects support the [] operator to look up a single key, the more general concept of slicing is not available. That is in line with the idea that there is no predictable order associated with a dict object's keys, and slicing requires an order.

To access deeper levels in nested data, like people, we chain the look-up operator []. For example, let's view all the "mathematicians" in people.

In [62]:

people["mathematicians"]

Out[62]:

[{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']},
 {'name': 'Leonhard Euler', 'emails': []}]

Let's take the first mathematician on the list, ...

In [63]:

people["mathematicians"][0]

Out[63]:

{'name': 'Gilbert Strang', 'emails': ['gilbert@mit.edu']}

... and output his "name" ...

In [64]:

people["mathematicians"][0]["name"]

Out[64]:

'Gilbert Strang'

... or his "emails".

In [65]:

people["mathematicians"][0]["emails"]

Out[65]:

['gilbert@mit.edu']

Mutability¶

We may mutate dict objects in place.

For example, let's translate the English words in to_words to their German counterparts. Behind the scenes, Python determines the bucket of the objects passed to the [] operator, looks them up in the hash table, and, if present, updates the references to the mapped value objects.

In [66]:

to_words

Out[66]:

{0: 'zero', 1: 'one', 2: 'two'}

In [67]:

to_words[0] = "null"
to_words[1] = "eins"
to_words[2] = "zwei"

In [68]:

to_words

Out[68]:

{0: 'null', 1: 'eins', 2: 'zwei'}

Let's add two more items. Again, Python determines their buckets, but this time finds them to be empty, and inserts the references to their key and value objects.

In [69]:

to_words[3] = "drei"
to_words[4] = "vier"

In [70]:

to_words

Out[70]:

{0: 'null', 1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}

None of these operations change the identity of the to_words object.

In [71]:

id(to_words)  # same memory location as before

Out[71]:

139936685526208

The del statement removes individual items. Python just removes the two references to the key and value objects in the corresponding bucket.

In [72]:

del to_words[0]

In [73]:

to_words

Out[73]:

{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}

We may also change parts of nested data, such as people.

For example, let's add Albert Einstein to the list of "physicists", ...

In [74]:

people["physicists"]

Out[74]:

[]

In [75]:

people["physicists"].append({"name": "Albert Einstein"})

... complete Guido's "name", ...

In [76]:

people["programmers"][0]

Out[76]:

{'name': 'Guido', 'emails': ['guido@python.org', 'guido@dropbox.com']}

In [77]:

people["programmers"][0]["name"] = "Guido van Rossum"

... and remove his work email because he retired.

In [78]:

del people["programmers"][0]["emails"][1]

Now, people looks like this.

In [79]:

pprint(people, indent=1, width=60)

{'mathematicians': [{'emails': ['gilbert@mit.edu'],
                     'name': 'Gilbert Strang'},
                    {'emails': [],
                     'name': 'Leonhard Euler'}],
 'physicists': [{'name': 'Albert Einstein'}],
 'programmers': [{'emails': ['guido@python.org'],
                  'name': 'Guido van Rossum'}]}

`dict` Methods¶

dict objects come with many methods bound on them (cf., documentation ), many of which are standardized by the Mapping and MutableMapping ABCs from the collections.abc module. While the former requires the .keys() , .values() , .items() , and .get() methods, which never mutate an object, the latter formalizes the .update() , .pop() , .popitem() , .clear() , and .setdefault() methods, which may do so.

In [80]:

import collections.abc as abc

In [81]:

isinstance(from_words, abc.Mapping)

Out[81]:

True

In [82]:

isinstance(from_words, abc.MutableMapping)

Out[82]:

True

While iteration over a mapping type already goes over its keys, we may emphasize this explicitly by adding the .keys() method in the for-loop. Again, the iteration order is equivalent to the insertion order but still considered unpredictable.

In [83]:

for word in from_words.keys():
    print(word)

zero
one
two

.keys() returns an object of type dict_keys. That is a dynamic view inside the from_words's hash table, which means it does not copy the references to the keys, and changes to from_words can be seen through it. View objects behave much like dict objects themselves.

In [84]:

from_words.keys()

Out[84]:

dict_keys(['zero', 'one', 'two'])

Views can be materialized with the list() built-in. However, that may introduce semantic errors into a program as the newly created list object has a "predictable" order (i.e., indexes 0, 1, ...) created from an unpredictable one.

In [85]:

list(from_words.keys())

Out[85]:

['zero', 'one', 'two']

To loop over the value objects instead, we use the .values() method. That returns a view (i.e., type dict_values) on the value objects inside from_words without copying them.

In [86]:

for number in from_words.values():
    print(number)

0
1
2

In [87]:

from_words.values()

Out[87]:

dict_values([0, 1, 2])

To loop over key-value pairs, we invoke the .items() method. That returns a view (i.e., type dict_items) on the key-value pairs as tuple objects, where the first element is the key and the second the value. Because of that, we use tuple unpacking in the for-loop.

In [88]:

for word, number in from_words.items():
    print(f"{word} -> {number}")

zero -> 0
one -> 1
two -> 2

In [89]:

from_words.items()

Out[89]:

dict_items([('zero', 0), ('one', 1), ('two', 2)])

Above, we see how the look-up operator fails loudly with a KeyError if a key is not in a dict object. For example, to_words does not have a key 0 any more.

In [90]:

to_words

Out[90]:

{1: 'eins', 2: 'zwei', 3: 'drei', 4: 'vier'}

In [91]:

to_words[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[91], line 1
----> 1 to_words[0]

KeyError: 0

That may be mitigated with the .get() method that takes two arguments: key and default. It returns the value object key maps to if it is in the dict object; otherwise, default is returned. If not provided, default is None.

In [92]:

to_words.get(0, "n/a")

Out[92]:

'n/a'

In [93]:

to_words.get(1, "n/a")

Out[93]:

'eins'

The .update() method takes the items of another mapping and either inserts them or overwrites the ones with matching keys already in the dict objects. It may be used in the other two ways as the dict() constructor allows, as well.

In [94]:

to_spanish = {
"cero",
"uno",
"dos",
"tres",
"cuatro",
"cinco",  
}

In [95]:

to_words.update(to_spanish)

In [96]:

to_words

Out[96]:

{1: 'uno', 2: 'dos', 3: 'tres', 4: 'cuatro', 0: 'cero', 5: 'cinco'}

In contrast to the pop() method of the list type, the .pop() method of the dict type requires a key argument to be passed. Then, it removes the corresponding key-value pair and returns the value object. If the key is not in the dict object, a KeyError is raised.

In [97]:

from_words

Out[97]:

{'zero': 0, 'one': 1, 'two': 2}

In [98]:

number = from_words.pop("zero")

In [99]:

number

Out[99]:

In [100]:

from_words

Out[100]:

{'one': 1, 'two': 2}

With an optional default argument, the loud KeyError may be suppressed and the default returned instead, just as with the .get() method above.

In [101]:

from_words.pop("zero")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[101], line 1
----> 1 from_words.pop("zero")

KeyError: 'zero'

In [102]:

from_words.pop("zero", 0)

Out[102]:

Similar to the pop() method of the list type, the .popitem() method of the dict type removes and returns an "arbitrary" key-value pair as a tuple object from a dict object. With the preservation of the insertion order in Python 3.7 and higher, this effectively becomes a "last in, first out" rule, just as with the list type. Once a dict object is empty, .popitem() raises a KeyError.

In [103]:

word, number = from_words.popitem()

In [104]:

word, number

Out[104]:

('two', 2)

In [105]:

from_words

Out[105]:

{'one': 1}

The .clear() method removes all items but keeps the dict object alive in memory.

In [106]:

to_words.clear()

In [107]:

to_words

Out[107]:

{}

In [108]:

from_words.clear()

In [109]:

from_words

Out[109]:

{}

The .setdefault() method may have a bit of an unfortunate name but is useful, in particular, with nested list objects. It takes two arguments, key and default, and returns the value mapped to key if key is in the dict object; otherwise, it inserts the key-default pair and returns a reference to the newly created value object. So, it is similar to the .get() method above but also mutates the dict object.

Consider the people example again and note how the dict object modeling "Albert Einstein" has no "emails" key in it.

In [110]:

pprint(people, indent=1, width=60)

{'mathematicians': [{'emails': ['gilbert@mit.edu'],
                     'name': 'Gilbert Strang'},
                    {'emails': [],
                     'name': 'Leonhard Euler'}],
 'physicists': [{'name': 'Albert Einstein'}],
 'programmers': [{'emails': ['guido@python.org'],
                  'name': 'Guido van Rossum'}]}

Let's say we want to append the imaginary emails "leonhard@math.org" and "albert@physics.org". We cannot be sure if a dict object modeling a person has already a "emails" key or not. To play it safe, we could first use the in operator to check for that and create a new list object in a second step if one is missing. Then, we would finally append the new email.

.setdefault() allows us to do all of the three steps at once. More importantly, behind the scenes Python only needs to make one key look-up instead of potentially three. For large nested data, that could speed up the computations significantly.

So, the first code cell below adds the email to the already existing empty list object, while the second one creates a new one first.

In [111]:

people["mathematicians"][1].setdefault("emails", []).append("leonhard@math.org")

In [112]:

people["physicists"][0].setdefault("emails", []).append("albert@physics.org")

In [113]:

pprint(people, indent=1, width=60)

{'mathematicians': [{'emails': ['gilbert@mit.edu'],
                     'name': 'Gilbert Strang'},
                    {'emails': ['leonhard@math.org'],
                     'name': 'Leonhard Euler'}],
 'physicists': [{'emails': ['albert@physics.org'],
                 'name': 'Albert Einstein'}],
 'programmers': [{'emails': ['guido@python.org'],
                  'name': 'Guido van Rossum'}]}

dict objects also come with a copy() method on them that creates shallow copies.

In [114]:

guido = people["programmers"][0].copy()

In [115]:

guido

Out[115]:

{'name': 'Guido van Rossum', 'emails': ['guido@python.org']}

If we mutate guido and, for example, remove all his emails with the .clear() method on the list type, these changes are also visible through people.

In [116]:

guido["emails"].clear()

In [117]:

guido

Out[117]:

{'name': 'Guido van Rossum', 'emails': []}

In [118]:

pprint(people, indent=1, width=60)

{'mathematicians': [{'emails': ['gilbert@mit.edu'],
                     'name': 'Gilbert Strang'},
                    {'emails': ['leonhard@math.org'],
                     'name': 'Leonhard Euler'}],
 'physicists': [{'emails': ['albert@physics.org'],
                 'name': 'Albert Einstein'}],
 'programmers': [{'emails': [],
                  'name': 'Guido van Rossum'}]}

`dict` Comprehensions¶

Analogous to list comprehensions in Chapter 8 , dict comprehensions are a concise literal notation to derive new dict objects out of existing ones.

For example, let's derive from_words out of to_words below by swapping the keys and values.

In [119]:

to_words = {
    0: "zero",
    1: "one",
    2: "two",
}

Without a dictionary comprehension, we would have to initialize an empty dict object, loop over the items of the original one, and insert the key-value pairs one by one in a reversed fashion as value-key pairs. That assumes that the values are unique as otherwise some would be merged.

In [120]:

from_words = {}

for number, word in to_words.items():
    from_words[word] = number

from_words

Out[120]:

{'zero': 0, 'one': 1, 'two': 2}

While that code is correct, it is also unnecessarily verbose. The dictionary comprehension below works in the same way as list comprehensions except that curly braces {} replace the brackets [] and a colon : is added to separate the keys from the values.

In [121]:

{v: k for k, v in to_words.items()}

Out[121]:

{'zero': 0, 'one': 1, 'two': 2}

We may filter out items with an if-clause and transform the remaining key and value objects.

For no good reason, let's filter out all words starting with a "z" and upper case the remainin words.

In [122]:

{v.upper(): k for k, v in to_words.items() if not v.startswith("z")}

Out[122]:

{'ONE': 1, 'TWO': 2}

Multiple for- and/or if-clauses are allowed.

For example, let's find all pairs of two numbers from 1 through 10 whose product is "close" to 50 (e.g., within a delta of 5).

In [123]:

{
    (x, y): x * y
    for x in range(1, 11) for y in range(1, 11)
    if abs(x * y - 50) <= 5
}

Out[123]:

{(5, 9): 45,
 (5, 10): 50,
 (6, 8): 48,
 (6, 9): 54,
 (7, 7): 49,
 (8, 6): 48,
 (9, 5): 45,
 (9, 6): 54,
 (10, 5): 50}