Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
The collections module in the standard library
provides specialized mapping types for common use cases.
defaultdict
Type¶The defaultdict type allows us to define a factory function that creates default values whenever we look up a key that does not yet exist. Ordinary
dict
objects would throw a KeyError
exception in such situations.
Let's say we have a list
with records of goals scored during a soccer game. The records consist of the fields "Country," "Player," and the "Time" when a goal was scored. Our task is to group the goals by player and/or country.
goals = [
("Germany", "Müller", 11), ("Germany", "Klose", 23),
("Germany", "Kroos", 24), ("Germany", "Kroos", 26),
("Germany", "Khedira", 29), ("Germany", "Schürrle", 69),
("Germany", "Schürrle", 79), ("Brazil", "Oscar", 90),
]
Using a normal dict
object, we have to tediously check if a player has already scored a goal before. If not, we must create a new list
object with the first time the player scored. Otherwise, we append the goal to an already existing list
object.
goals_by_player = {}
for _, player, minute in goals:
if player not in goals_by_player:
goals_by_player[player] = [minute]
else:
goals_by_player[player].append(minute)
goals_by_player
{'Müller': [11], 'Klose': [23], 'Kroos': [24, 26], 'Khedira': [29], 'Schürrle': [69, 79], 'Oscar': [90]}
Instead, with a defaultdict
object, we can portray the code fragment's intent in a concise form. We pass a reference to the list() built-in to
defaultdict
.
from collections import defaultdict
goals_by_player = defaultdict(list)
for _, player, minute in goals:
goals_by_player[player].append(minute)
goals_by_player
defaultdict(list, {'Müller': [11], 'Klose': [23], 'Kroos': [24, 26], 'Khedira': [29], 'Schürrle': [69, 79], 'Oscar': [90]})
type(goals_by_player)
collections.defaultdict
A reference to the factory function is stored in the default_factory
attribute.
goals_by_player.default_factory
list
If we want this code to produce a normal dict
object, we pass goals_by_player
to the dict() constructor.
dict(goals_by_player)
{'Müller': [11], 'Klose': [23], 'Kroos': [24, 26], 'Khedira': [29], 'Schürrle': [69, 79], 'Oscar': [90]}
Being creative, we use a factory function, created with a lambda
expression, that returns another defaultdict with list()
as its factory to group on the country and the player level simultaneously.
goals_by_country_and_player = defaultdict(lambda: defaultdict(list))
for country, player, minute in goals:
goals_by_country_and_player[country][player].append(minute)
goals_by_country_and_player
defaultdict(<function __main__.<lambda>()>, {'Germany': defaultdict(list, {'Müller': [11], 'Klose': [23], 'Kroos': [24, 26], 'Khedira': [29], 'Schürrle': [69, 79]}), 'Brazil': defaultdict(list, {'Oscar': [90]})})
Conversion into a normal and nested dict
object is now a bit tricky but can be achieved in one line with a comprehension.
{country: dict(by_player) for country, by_player in goals_by_country_and_player.items()}
{'Germany': {'Müller': [11], 'Klose': [23], 'Kroos': [24, 26], 'Khedira': [29], 'Schürrle': [69, 79]}, 'Brazil': {'Oscar': [90]}}
Counter
Type¶A common task is to count the number of occurrences of elements in an iterable.
The Counter type provides an easy-to-use interface that can be called with any iterable and returns a
dict
-like object of type Counter
that maps each unique elements to the number of times it occurs.
To continue the previous example, let's create an overview that shows how many goals a player scorred. We use a generator expression as the argument to Counter
.
goals
[('Germany', 'Müller', 11), ('Germany', 'Klose', 23), ('Germany', 'Kroos', 24), ('Germany', 'Kroos', 26), ('Germany', 'Khedira', 29), ('Germany', 'Schürrle', 69), ('Germany', 'Schürrle', 79), ('Brazil', 'Oscar', 90)]
from collections import Counter
scorers = Counter(x[1] for x in goals)
scorers
Counter({'Kroos': 2, 'Schürrle': 2, 'Müller': 1, 'Klose': 1, 'Khedira': 1, 'Oscar': 1})
type(scorers)
collections.Counter
Now we can look up individual players. scores
behaves like a normal dictionary with regard to key look-ups.
scorers["Müller"]
1
By default, it returns 0
if a key is not found. So, we do not have to handle a KeyError
.
scorers["Lahm"]
0
Counter
objects have a .most_common() method that returns a
list
object containing 2-element tuple
objects, where the first element is the element from the original iterable and the second the number of occurrences. The list
object is sorted in descending order of occurrences.
scorers.most_common(2)
[('Kroos', 2), ('Schürrle', 2)]
We can increase the count of individual entries with the .update() method: That takes an iterable of the elements we want to count.
Imagine if Philipp Lahm had also scored against Brazil.
scorers.update(["Lahm"])
scorers
Counter({'Kroos': 2, 'Schürrle': 2, 'Müller': 1, 'Klose': 1, 'Khedira': 1, 'Oscar': 1, 'Lahm': 1})
If we use a str
object as the argument instead, each individual character is treated as an element to be updated. That is most likely not what we want.
scorers.update("Lahm")
scorers
Counter({'Kroos': 2, 'Schürrle': 2, 'Müller': 1, 'Klose': 1, 'Khedira': 1, 'Oscar': 1, 'Lahm': 1, 'L': 1, 'a': 1, 'h': 1, 'm': 1})
ChainMap
Type¶Consider to_words
, more_words
, and even_more_words
below. Instead of merging the items of the three dict
objects together into a new one, we want to create an object that behaves as if it contained all the unified items in it without materializing them in memory a second time.
to_words = {
0: "zero",
1: "one",
2: "two",
}
more_words = {
2: "TWO", # to illustrate a point
3: "three",
4: "four",
}
even_more_words = {
4: "FOUR", # to illustrate a point
5: "five",
6: "six",
}
The ChainMap type allows us to do precisely that.
from collections import ChainMap
We simply pass all mappings as positional arguments to ChainMap
and obtain a proxy object that occupies almost no memory but gives us access to the union of all the items.
chain = ChainMap(to_words, more_words, even_more_words)
Let's loop over the items in chain
and see what is "in" it. The order is obviously unpredictable but all seven items we expected are there. Keys of later mappings do not overwrite earlier keys.
for number, word in chain.items():
print(number, word)
4 four 5 five 6 six 2 two 3 three 0 zero 1 one
When looking up a non-existent key, ChainMap
objects raise a KeyError
just like normal dict
objects would.
chain[10]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[28], line 1 ----> 1 chain[10] File /usr/lib64/python3.12/collections/__init__.py:1014, in ChainMap.__getitem__(self, key) 1012 except KeyError: 1013 pass -> 1014 return self.__missing__(key) File /usr/lib64/python3.12/collections/__init__.py:1006, in ChainMap.__missing__(self, key) 1005 def __missing__(self, key): -> 1006 raise KeyError(key) KeyError: 10