Dictionaries are a powerful data structure used for a number of purposes in Python. In these notes, I show the basics of how dictionaries work and how they can be used to group words in a text. Then I show how the resulting dictionary data structure can be used to perform interesting "cut-ups" through replacement.
Let's start with a simple example: make a list of all words in a text file that begin with the letter a
.
words = open("frost.txt").read().split()
a_words = [item for item in words if item.startswith("a")]
a_words
['a', 'as', 'as', 'as', 'as', 'and', 'as', 'about', 'another', 'a', 'ages', 'and', 'ages', 'a', 'and', 'all']
This might be a poetic artifact in itself, or it might be the start of a more elaborate project. You might, for example, use this information as a basis for computational stylistics (e.g., do 20th-century American poets use words beginning with the letter a
more frequently than 19th-century British poets?). Or you might be attempting to create a computational model for poetic composition that uses this information, such as a program to generate acrostics.
Actually, let's continue with the goal of making a program to generate acrostics from a given text. Our acrostic procedure will start with a "seed" string, e.g., the word whose letters to use as the initial letter of each line, and a source text, which will be parsed into words. The poem will consist of a list of lines, one for each letter in the seed string. Each line will have one word, randomly chosen from a list of words beginning with the corresponding letter. So, for example, using The Road Not Taken as the source text, and robertfrost
as the seed string, we might end up with output that looks like this:
Really
One
Both
Equally
Roads
Took
For
Really
On
Shall
The
(Most acrostics have more than one word on each line, which is fine. The procedure I'm proposing is a first step toward being able to make a more robust acrostic-generation algorithm.)
How to implement this? Seems simple enough. We already know how to get a list of all words beginning with a particular letter. So hey, let's just copy-and-paste that code for each letter that we want! Something like this:
words = open("frost.txt").read().split()
r_words = [item for item in words if item.startswith("r")]
o_words = [item for item in words if item.startswith("o")]
b_words = [item for item in words if item.startswith("b")]
e_words = [item for item in words if item.startswith("e")]
t_words = [item for item in words if item.startswith("t")]
Hmm. Just starting with the first few letters of robertfrost
, it already feels like this is not quite the right way to do this. But let's see it through:
import random
print(random.choice(r_words).capitalize())
print(random.choice(o_words).capitalize())
print(random.choice(b_words).capitalize())
print(random.choice(e_words).capitalize())
print(random.choice(r_words).capitalize())
print(random.choice(t_words).capitalize())
Roads Other, Both Ever Roads To
This works, and we could easily complete the program using this technique. There are a few problems with engineering the program this way, however. Let's imagine some scenarios:
difference
instead. Or antidisestablishmentarianism
. To make the change, I essentially have to start from scratch and copy/paste all new lines of code.print(random.choice(r_words))
). This is fine for short seed texts, but what if you wanted to generate an acrostic that with a seed string of hundreds of characters? Thousands? Millions? That's a lot of copying and pasting."Okay," you say. "I understand that you think this implementation is not optimal. But you want to be able to write one chunk of code that can be used to extract words that start with any arbitrary character, even when you don't know which characters will be needed when you're writing the program. And then you want some kind of magical... code... thing... that lets you get back all of the words that start with an arbitrary character. Surely what you propose is is science fiction; surely this is impossible."
Note: Another way of thinking about this problem is how do you get Python to have variables for all of the data that you want to store, when you're not quite sure what data will be used as input when you write the code?
In fact, it is possible! But to implement such a chunk of code, we need a new data structure. The appropriate data structure in this instance is called the dictionary. Dictionaries are also known as maps, hashes or associative arrays in other programming languages.
Before we get into the why of dictionaries, let's briefly look at the how. A dictionary in Python looks like this:
{'Mercury': 0.387, 'Venus': 0.723, 'Earth': 1.0, 'Mars': 1.523}
{'Earth': 1.0, 'Mars': 1.523, 'Mercury': 0.387, 'Venus': 0.723}
That is: a sequence of key/value pairs, with the key and value of each pair separated by a colon (:
) and the pairs themselves separated by commas (,
). All of the pairs are themselves surrounded by a pair of curly brackets ({
and }
). (In this case, the keys are the names of the planets, and the values are the planets' mean distances from the Sun as measured in astronomical units.)
In other words, we might say that the key Mercury
has the value 0.387
. The verb map is also sometimes used to refer to this relationship: e.g., the key Mars
maps to the value 1.523
.
A dictionary is just like any other Python value. You can assign it to a variable:
planet_dist = {'Mercury': 0.387, 'Venus': 0.723, 'Earth': 1.0, 'Mars': 1.523}
And that variable has a type:
type(planet_dist)
dict
At its most basic level, a dictionary is sort of like a two-column spreadsheet, where the key is one column and the value is another column. If you were to represent the dictionary above as a spreadsheet, it might look like this:
key | value |
---|---|
Mercury | 0.387 |
Venus | 0.723 |
Earth | 1.0 |
Mars | 1.523 |
The main difference between a spreadsheet and a dictionary is that dictionaries are unordered. (For an explanation of this, see below.) As with a spreadsheet, you can put different types of data into a dictionary.
The primary operation that we'll perform on dictionaries is writing an expression that evaluates to the value for a particular key. We do that with the same syntax we used to get a value at a particular index from a list, with a twist: when using a dictionary, instead of using a number, use one of the keys that we had specified for the value when making the dictionary. For example, if you want to know how far Venus is from the sun (or, more precisely, the value for the key Venus
), write the following expression:
planet_dist["Venus"]
0.723
Going back to our spreadsheet analogy, this is like looking for the row whose first column is "Venus" and getting the value from the corresponding second column.
If we put a key in those brackets that does not exist in the dictionary, we get an error similar to the one we get when trying to access an element of an array beyond the end of a list:
planet_dist["Planet X"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-52-5aed015a470c> in <module>() ----> 1 planet_dist["Planet X"] KeyError: 'Planet X'
The in
operator lets you check to see if a key is in a dictionary before attempting to retrieve its value:
"Pluto" in planet_dist
False
The not in
operator is the opposite of in
: it returns True
if the given key is not in the dictionary.
"Pluto" not in planet_dist
True
As you might suspect, the thing you put inside the brackets doesn't have to be a string; it can be any Python expression, as long as it evaluates to something that is a key in the dictionary:
planet = 'Mercury'
planet_dist[planet]
0.387
After a dictionary has been created, you might want to add new key/value pairs to it. You can do so with an assignment statement, putting the desired value to the right of the =
and square bracket notation with the desired key to the left:
planet_dist["Jupiter"] = 5.2
A common pattern when working with dictionaries is to use the current value as the basis for the new value that you want to store. For example, you could wreak havoc with the solar system by changing how far Jupiter is from the sun by running this code:
planet_dist["Jupiter"] = planet_dist["Jupiter"] + 0.5
Or, more compactly:
planet_dist["Jupiter"] += 0.5
As with any other kind of value, you can evaluate a dictionary to see its contents in Jupyter Notebook. In the cell below, I do just this so we can see the new key:
planet_dist
{'Earth': 1.0, 'Jupiter': 6.7, 'Mars': 1.523, 'Mercury': 0.387, 'Venus': 0.723}
Keys in a dictionary don't have to be strings, and the values don't have to be floating-point numbers. For example, the key might be an integer and the value a string, as in the following example:
number_words = {0: "zero", 1: "one", 2: "two", 3: "three", 4: "four", 5: "five", 10: "ten", 100: "one hundred"}
number_words
{0: 'zero', 1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', 10: 'ten', 100: 'one hundred'}
print("I have " + number_words[5] + " years left until retirement.")
I have five years left until retirement.
In fact, the values of a dictionary can be of any Python type. (The keys can be of any Python type, except mutable data structures like dictionaries and lists.) Often when talking about dictionaries, you'll hear them described in terms of what types their keys and values are.
Going back to our acrostic example, the kind of dictionary that interests us here is a dictionary whose keys are strings and whose values are lists of strings. If you were to construct one of these by hand, you might write something like this:
planet_moons = {
'Mercury': [],
'Venus': [],
'Earth': ['Moon'],
'Mars': ['Phobos', 'Deimos'],
'Jupiter': ["Io", "Europa", "Ganymede", "Callisto"]
}
This is a dictionary whose keys are the names of the five innermost planets and whose values are lists of the names of the moons of those planets. (Including only the four largest moons of Jupiter for brevity.) Retrieving the value for a key in this data structure will yield a value of type list
. For example, to count the number of moons orbiting Mars:
len(planet_moons['Mars'])
2
To get the name of the second moon (index 1) listed as orbiting Mars:
planet_moons["Mars"][1]
'Deimos'
And to get a random moon of Jupiter:
import random
random.choice(planet_moons["Jupiter"])
'Callisto'
Let's say that I change careers and become an astronomer. After several years of diligent search, using tiny variations in the rotation of Venus and pressure variations in the atmosphere that are most easily explained by tidal mechanics, I discover a new moon of Venus and name it after myself. My crowning moment of glory would be to add this newly discovered moon to the data structure that I made years before in my previous life as a lowly computer poet. I would do this like so:
planet_moons["Venus"].append("Allison")
planet_moons
{'Earth': ['Moon'], 'Jupiter': ['Io', 'Europa', 'Ganymede', 'Callisto'], 'Mars': ['Phobos', 'Deimos'], 'Mercury': [], 'Venus': ['Allison']}
Glorious! It may not feel like it, but I've now shown you everything you need to know in order to create the computational acrostic code described above. Let's make it happen.
With the dictionary as a starting point, think about how we need the data organized in order to produce the acrostic. We need to be able to store every word that begins with a particular letter, and we need to be able to get back a list of words that begin with any letter. The data structure we want to end up with might look something like this:
{'a': ['as', 'above', 'about'],
'b': ['both', 'bent'],
'c': ['could', 'could', 'claim', 'come'],
...
}
(Using an ellipsis there to indicate that there would be more key/value pairs.)
The task at hand consists of two parts: analyzing the text and generating the text. The goal of the analysis step is to create a dictionary whose keys are initial letters and whose values are all of the words in the text that start with that letter.
To make this data structure, we're going to build the dictionary gradually, word by word, by looping over a list of words in the text. Here's what the code looks like. I've left comments in-line.
words = open("frost.txt").read().split() # read in a text file, split into words
initials = {} # create an empty dictionary
for item in words: # run this code for every word in the text
first_let = item[0] # first_let now has the first character of the string
# check to see if the letter is already a key in the dictionary.
# if not, add a new key/value pair with an empty list as the value.
if first_let not in initials:
initials[first_let] = []
# append the current word to the list that is the value for this key
initials[first_let].append(item)
# uncomment line below to see debug output
#print(item, first_let, initials[first_let])
Here's what the data structure looks like when everything's done:
initials
{'A': ['And', 'And', 'And', 'And', 'And', 'And'], 'B': ['Because'], 'H': ['Had'], 'I': ['I', 'I', 'I', 'In', 'I', 'I', 'I', 'I', 'I—', 'I'], 'O': ['Oh,'], 'S': ['Somewhere'], 'T': ['Two', 'To', 'Then', 'Though', 'Two'], 'Y': ['Yet'], 'a': ['a', 'as', 'as', 'as', 'as', 'and', 'as', 'about', 'another', 'a', 'ages', 'and', 'ages', 'a', 'and', 'all'], 'b': ['both', 'be', 'bent', 'better', 'both', 'black.', 'back.', 'be', 'by,'], 'c': ['could', 'could', 'claim,', 'come'], 'd': ['diverged', 'down', 'day!', 'doubted', 'diverged', 'difference.'], 'e': ['equally', 'ever'], 'f': ['far', 'fair,', 'for', 'first', 'for'], 'g': ['grassy'], 'h': ['having', 'had', 'how', 'hence:', 'has'], 'i': ['in', 'it', 'in', 'it', 'if', 'in'], 'j': ['just'], 'k': ['kept', 'knowing'], 'l': ['long', 'looked', 'lay', 'leaves', 'leads', 'less'], 'm': ['morning', 'made'], 'n': ['not', 'no'], 'o': ['one', 'one', 'other,', 'on', 'one'], 'p': ['perhaps', 'passing'], 'r': ['roads', 'really', 'roads'], 's': ['sorry', 'stood', 'same,', 'step', 'should', 'shall', 'sigh'], 't': ['travel', 'traveler,', 'the', 'took', 'the', 'the', 'that', 'the', 'there', 'them', 'the', 'that', 'trodden', 'the', 'to', 'telling', 'this', 'took', 'the', 'travelled', 'that', 'the'], 'u': ['undergrowth;'], 'w': ['wood,', 'where', 'was', 'wanted', 'wear;', 'worn', 'way', 'way,', 'with', 'wood,'], 'y': ['yellow']}
Challenge: Modify the code above so that it's case-insensitive (i.e., words starting with
I
are stored in the same list as words starting withi
).
We have the data structure that maps letters to lists of words that start with those letters.
initials["a"]
['a', 'as', 'as', 'as', 'as', 'and', 'as', 'about', 'another', 'a', 'ages', 'and', 'ages', 'a', 'and', 'all']
initials["b"]
['both', 'be', 'bent', 'better', 'both', 'black.', 'back.', 'be', 'by,']
initials["c"]
['could', 'could', 'claim,', 'come']
Picking a random item from one of these lists using random.choice()
:
random.choice(initials["d"])
'doubted'
Writing the acrostic text is now just a matter of a list comprehension:
acrostic = [random.choice(initials[let]).capitalize() for let in "robertfrost"]
print("\n".join(acrostic))
Roads One Better Equally Roads That For Roads One Step Travel
Or, as a for
loop:
for let in "robertfrost":
word = random.choice(initials[let])
print(word.capitalize())
Really One Be Equally Roads The For Roads One Sigh Took
There are, of course, a number of letters in the alphabet not represented in our data. For example, there are no words starting with z
. So if we tried to make an acrostic with the word pizza
:
for let in "pizza":
word = random.choice(initials[let])
print(word.capitalize())
Passing It
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-121-11f5aec41b7b> in <module>() 1 for let in "pizza": ----> 2 word = random.choice(initials[let]) 3 print(word.capitalize()) KeyError: 'z'
We get a KeyError
because the dictionary does not contain the key z
. Whoops! There's no real way to fix this problem without expanding the data that we're using, but we can at least write the code so that it's robust against these kinds of errors. There are two ways of doing this. The first is to check if the key is present, using the in
operator:
for let in "pizza":
if let in initials:
word = random.choice(initials[let])
print(word.capitalize())
else:
print(">>> WARNING! ACROSTIC FAILURE! WARNING")
Passing It >>> WARNING! ACROSTIC FAILURE! WARNING >>> WARNING! ACROSTIC FAILURE! WARNING Another
The second is to use the dictionary value's .get()
method, which attempts to retrieve the value for the key given as the first parameter, and if that key is not found, returns the value given as the second parameter:
for let in "pizza":
word = random.choice(initials.get(let, ["???"]))
print(word.capitalize())
Passing In ??? ??? A
There's a little dance we did in the code above to check to add an empty list for a key only if the key wasn't already present. That dance is so common that the makers of Python have invented away around it: the defaultdict
. A defaultdict
is like a regular dictionary, except that if you attempt to assign a value to a non-existent key, it will automatically create that key with a default value. To use the defaultdict
data structure, you first need to include and run the following line in your code:
from collections import defaultdict
Now you can create a defaultdict
value by calling the defaultdict()
function, with the default data type inside the parentheses. For example, to create a dictionary whose values default to lists:
initials = defaultdict(list)
Now the code for writing the acrostic text analyzer is a little bit simpler. You don't need to check for the presence of the key/value pair; you can just append to it as though it already exists:
words = open("frost.txt").read().split()
for item in words:
first_let = item[0]
initials[first_let].append(item)
initials
defaultdict(list, {'A': ['And', 'And', 'And', 'And', 'And', 'And'], 'B': ['Because'], 'H': ['Had'], 'I': ['I', 'I', 'I', 'In', 'I', 'I', 'I', 'I', 'I—', 'I'], 'O': ['Oh,'], 'S': ['Somewhere'], 'T': ['Two', 'To', 'Then', 'Though', 'Two'], 'Y': ['Yet'], 'a': ['a', 'as', 'as', 'as', 'as', 'and', 'as', 'about', 'another', 'a', 'ages', 'and', 'ages', 'a', 'and', 'all'], 'b': ['both', 'be', 'bent', 'better', 'both', 'black.', 'back.', 'be', 'by,'], 'c': ['could', 'could', 'claim,', 'come'], 'd': ['diverged', 'down', 'day!', 'doubted', 'diverged', 'difference.'], 'e': ['equally', 'ever'], 'f': ['far', 'fair,', 'for', 'first', 'for'], 'g': ['grassy'], 'h': ['having', 'had', 'how', 'hence:', 'has'], 'i': ['in', 'it', 'in', 'it', 'if', 'in'], 'j': ['just'], 'k': ['kept', 'knowing'], 'l': ['long', 'looked', 'lay', 'leaves', 'leads', 'less'], 'm': ['morning', 'made'], 'n': ['not', 'no'], 'o': ['one', 'one', 'other,', 'on', 'one'], 'p': ['perhaps', 'passing'], 'r': ['roads', 'really', 'roads'], 's': ['sorry', 'stood', 'same,', 'step', 'should', 'shall', 'sigh'], 't': ['travel', 'traveler,', 'the', 'took', 'the', 'the', 'that', 'the', 'there', 'them', 'the', 'that', 'trodden', 'the', 'to', 'telling', 'this', 'took', 'the', 'travelled', 'that', 'the'], 'u': ['undergrowth;'], 'w': ['wood,', 'where', 'was', 'wanted', 'wear;', 'worn', 'way', 'way,', 'with', 'wood,'], 'y': ['yellow']})
Exercise: Use
defaultdict
to make a dictionary that has integer keys for word length, whose values are lists of words with that length. The dictionary that you end up with should look something like:{1: ['a', 'I', ...], 2: ['in', 'be', 'as'...], 3: ['Two', 'And', 'not']...}
This particular data structure is good for more than just acrostics. Here's another possible use: take a text and replace each of its words with a different word that begins with the same letter. You can do this with the entire poem by reading it in like this:
words = open("frost.txt").read().split()
print(' '.join([random.choice(initials[item[0]]) for item in words]))
Though roads difference. in as yellow with And shall I come no there by, And be one the looked I sigh And looked doubted other, as fair, and I claim, Two where in be if there undergrowth; Two trodden that one and just and far And how perhaps that by, come Because it wanted grassy all was where Two a far the the perhaps telling Had wear; the really ages travel should And be the morning equally long I— leads not shall how the bent Oh, In kept the fair, for ages day! Yet kept having way lay one there wanted I down if I step equally could back. I— sorry black. this that wood, as stood Somewhere as ages another hence: Then roads diverged in a where another I I took took one leaves the better And the how made as travelled day!
That expression—[random.choice(initials[item[0]]) for item in words]
—is pretty complex. Let's break it down.
[<expression> for item in words]
: list comprehension that evaluates <expression>
for every element in list words
, with the temporary variable item
item[0]
: the first letter of the string in item
initials[item[0]]
: the list of words beginning with that letterrandom.choice(initials[item[0]])
: a randomly-chosen element from that list[random.choice(initials[item[0]]) for item in words]
: a list of random-chosen words, each of which beginning with the same first letter of the words in list words
If you want to preserve the line breaks in the original poem, the easiest way is to write a for
loop that reads each line of the text file as a string, like so:
for line in open("frost.txt"):
words = line.split()
print(' '.join([random.choice(initials[item[0]]) for item in words]))
Two roads difference. if all yellow wood, And sigh I come not took both And both other, trodden long I stood And less day! one as for and I come Though was in back. if the undergrowth; To to took one and just a far And having perhaps the be come Because in wood, grassy as worn wanted To all fair, took travelled passing took Had with took really a this should And by, the made ever looked I long no step had traveler, by, Oh, I kept that fair, far as difference. Yet kept how worn looked on the wood, I— difference. if I should ever claim, be I stood bent traveler, to with as same, Somewhere and as as having Then really diverged in a wanted and I I there the one leaves that both And traveler, hence: morning as the doubted
Of course, there's no reason that we have to do lexical replacement on the same text that we got the original words from! In the cell below, I make a separate model of words from Sea Rose:
sea_rose_init = defaultdict(list)
words = open("sea_rose.txt").read().split()
for item in words:
first_let = item[0]
sea_rose_init[first_let].append(item)
sea_rose_init
defaultdict(list, {'-': ['--'], 'C': ['Can'], 'R': ['Rose,'], 'S': ['Stunted,'], 'a': ['and', 'a', 'a', 'are', 'are', 'are', 'acrid', 'a'], 'c': ['caught', 'crisp'], 'd': ['drift.', 'drives', 'drip'], 'f': ['flower,', 'flung', 'fragrance'], 'h': ['harsh', 'hardened'], 'i': ['in', 'in', 'in', 'in'], 'l': ['leaf,', 'leaf,', 'lifted', 'leaf?'], 'm': ['marred', 'meagre', 'more'], 'o': ['of', 'of', 'on', 'on'], 'p': ['petals,', 'precious'], 'r': ['rose,', 'rose'], 's': ['stint', 'spare', 'single', 'stem', 'small', 'sand,', 'sand', 'spice-rose', 'such'], 't': ['thin,', 'than', 'the', 'the', 'the', 'that', 'the', 'the'], 'w': ['with', 'wet', 'with', 'wind.'], 'y': ['you', 'you', 'you']})
The goal of the code below is to rewrite each line of The Road Not Taken, replacing each word with a word that begins with the same letter from Sea Rose. The problem is that The Road Not Taken has words that begin with letters that aren't found as word-initial letters in Sea Rose! So we need a backup strategy. The strategy I chose below was to check to see if the first letter of each word was present in the dictionary. If it is, then get a random word starting with that letter. If it isn't, then just use the word from the source text.
# for each line in the text file...
for line in open("frost.txt"):
words = line.split() # split the line into words
output = [] # the output for each line starts with an empty list
for item in words: # for each word in the line...
first_let = item[0]
# check if the first letter is in the dictionary
if first_let in sea_rose_init: # if we have alternatives for that letter...
# add a randomly-chosen word that starts with the same letter to the output
output.append(random.choice(sea_rose_init[first_let]))
else:
# otherwise, just use the word from the source text
output.append(item)
# uncomment line below to see how the list gets built, item by item
#print(line, item, output)
print(' '.join(output))
Two rose, drives in are you with And such I caught not the both And be of the leaf, I such And leaf, drives of are flung a I crisp To with in bent in the undergrowth; Then the the on and just and flung And harsh petals, that better crisp Because in wet grassy and with with Though are flung than than petals, that Had wet the rose, a than stem And both the meagre equally leaf? In leaf, no stint hardened thin, black. Oh, I kept thin, flower, flung a drip Yet knowing harsh wet leaf, of the wet I drives in I single ever caught back. I sand, be the the with and sand Stunted, a and are harsh Two rose, drift. in are with a I— I the than of leaf? the by, And the harsh more acrid the drip