Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
In this chapter, we continue the study of the built-in data types. The next layer on top of numbers consists of textual data that are modeled primarily with the str
type in Python. str
objects are more complex than the numeric objects in Chapter 5 as they consist of an arbitrary and possibly large number of individual characters that may be chosen from any alphabet in the history of humankind. Luckily, Python abstracts away most of this complexity from us. However, after looking at the
str
type in great detail, we briefly introduce the bytes
type at the end of this chapter to understand how characters are modeled in memory.
str
Type¶To create a str
object, we use the literal notation and type the text between enclosing double quotes "
.
text = "Lorem ipsum dolor sit amet."
Like everything in Python, text
is an object with an identity, a type, and a value.
id(text)
140667764715968
type(text)
str
As seen before, a str
object evaluates to itself in a literal notation with enclosing single quotes '
.
In Chapter 1 , we specify the double quotes
"
convention this book follows. Yet, single quotes '
and double quotes "
are perfect substitutes. We could use the reverse convention, as well. As this discussion shows, many programmers have strong opinions about such conventions. Consequently, the discussion was "closed as not constructive" by the moderators.
text
'Lorem ipsum dolor sit amet.'
As the single quote '
is often used in the English language as a shortener, we could make an argument in favor of using the double quotes "
: There are possibly fewer situations like the two code cells below, where we must escape the kind of quote used as the str
object's delimiter with a backslash "\"
inside the text (cf., also the "Unicode & (Special) Characters" section further below). However, double quotes "
are often used as well, for example, to indicate a quote like the one by Albert Einstein below. So, such arguments are not convincing.
Many proponents of the single quote '
usage claim that double quotes "
cause more visual noise on the screen. However, this argument is also not convincing as, for example, one could claim that two single quotes ''
look so similar to one double quote "
that a reader may confuse an empty str
object with a missing closing quote "
. With the double quotes "
convention we at least avoid such confusion (i.e., empty str
objects are written as ""
).
This discussion is an excellent example of a flame war in the programming world: Everyone has an opinion and the discussion leads to no result.
"Einstein said, \"If you can't explain it, you don't understand it.\""
'Einstein said, "If you can\'t explain it, you don\'t understand it."'
'Einstein said, "If you can\'t explain it, you don\'t understand it."'
'Einstein said, "If you can\'t explain it, you don\'t understand it."'
An important fact to know is that enclosing quotes of either kind are not part of the str
object's value! They are merely syntax indicating the literal notation.
So, printing out the sentence with the built-in print() function does the same in both cases.
print("Einstein said, \"If you can't explain it, you don't understand it.\"")
Einstein said, "If you can't explain it, you don't understand it."
print('Einstein said, "If you can\'t explain it, you don\'t understand it."')
Einstein said, "If you can't explain it, you don't understand it."
As an alternative to the literal notation, we may use the built-in str() constructor to cast non-
str
objects as str
ones. As Chapter 11 reveals, basically any object in Python has a text representation. Because of that we may also pass
list
objects, the boolean True
and False
, or None
to str() .
str(42)
'42'
str(42.87)
'42.87'
str([1, 2, 3])
'[1, 2, 3]'
str(True)
'True'
str(False)
'False'
str(None)
'None'
user_input = input("Whatever you enter is put in a new string: ")
type(user_input)
str
user_input
'123'
file = open("lorem_ipsum.txt")
open() returns a proxy
object of type
TextIOWrapper
that allows us to interact with the file on disk. mode='r'
shows that we opened the file in read-only mode and encoding='UTF-8'
is explained in detail in the The bytes
Type section at the end of this chapter.
type(file)
_io.TextIOWrapper
file
<_io.TextIOWrapper name='lorem_ipsum.txt' mode='r' encoding='UTF-8'>
TextIOWrapper
objects come with plenty of type-specific methods and attributes.
file.readable()
True
file.writable()
False
file.name
'lorem_ipsum.txt'
file.encoding
'UTF-8'
So far, we have not yet read anything from the file (i.e., from disk)! That is intentional as, for example, the file could contain more data than could fit into our computer's memory. Therefore, we have to explicitly instruct the file
object to read some of or all the data in the file.
One way to do that, is to simply loop over the file
object with the for
statement as shown next: In each iteration, line
is assigned the next line in the file. Because we may loop over TextIOWrapper
objects, they are iterables.
for line in file:
print(line)
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets.
Once we looped over the file
object, it is exhausted: We can not loop over it a second time. So, the built-in print() function is never called in the code cell below!
for line in file:
print(line)
After the for
-loop, the line
variable is still set and references the last line in the file. We verify that it is indeed a str
object.
line
'the 1960s with the release of Letraset sheets.\n'
type(line)
str
An important observation is that the file
object is still associated with an open file descriptor . Without going into any technical details, we note that an operating system can only handle a limited number of "open files" at the same time, and, therefore, we should always close the file once we are done processing it.
TextIOWrapper
objects have a closed
attribute on them that indicates if the associated file descriptor is still open or has been closed. We can "manually" close any TextIOWrapper
object with the close() method.
file.closed
False
file.close()
file.closed
True
The more Pythonic way is to use open() within the compound
with
statement (cf., reference ): In the example below, the indented code block is said to be executed within the context of the
file
object that now plays the role of a context manager . Many different kinds of context managers exist in Python with different applications and purposes. Context managers returned from open()
mainly ensure that file descriptors get automatically closed after the last line in the code block is executed.
with open("lorem_ipsum.txt") as file:
for line in file:
print(line)
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets.
file.closed
True
Using syntax familiar from Chapter 3 to explain what the
with open(...) as file:
does above, we provide an alternative formulation with a try
statement below: The finally
-branch is always executed, even if an exception is raised inside the for
-loop. Therefore, file
is sure to be closed too. However, this formulation is somewhat less expressive.
try:
file = open("lorem_ipsum.txt")
for line in file:
print(line)
finally:
file.close()
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets.
file.closed
True
As an alternative to reading the contents of a file by looping over a TextIOWrapper
object, we may also call one of the methods they come with.
For example, the read() method takes a single
size
argument of type int
and returns a str
object with the specified number of characters.
file = open("lorem_ipsum.txt")
file.read(11)
'Lorem Ipsum'
When we call read() again, the returned
str
object begins where the previous one left off. This is because TextIOWrapper
objects like file
simply store a position at which the associated file on disk is being read. In other words, file
is like a cursor pointing into a file.
file.read(11)
' is simply '
On the contrary, the readline() method keeps reading until it hits a newline character. These are shown in
str
objects as "\n"
.
file.readline()
'dummy text of the printing and typesetting industry.\n'
When we call readline() again, we obtain the next line.
file.readline()
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s\n"
Lastly, the readlines() method returns a
list
object that holds all lines in the file
from the current position to the end of the file. The latter position is often abbreviated as EOF in the documentation. Let's always remember that readlines() has the potential to crash a computer with a
MemoryError
.
file.readlines()
['when an unknown printer took a galley of type and scrambled it to make a type\n', 'specimen book. It has survived not only five centuries but also the leap into\n', 'electronic typesetting, remaining essentially unchanged. It was popularised in\n', 'the 1960s with the release of Letraset sheets.\n']
Calling readlines() a second time, is as pointless as looping over
file
a second time.
file.readlines()
[]
file.close()
Because every str
object created by reading the contents of a file in any of the ways shown in this section ends with a "\n"
, we see empty lines printed between each line
in the for
-loops above. To print the entire text without empty lines in between, we pass a end=""
argument to the print() function.
with open("lorem_ipsum.txt") as file:
for line in file:
print(line, end="")
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets.
A sequence is yet another abstract concept (cf., the "Containers vs. Iterables" section in Chapter 4 ).
It unifies four orthogonal (i.e., "independent") concepts into one bigger idea: Any data type, such as
str
, is considered a sequence if it
Chapter 7 formalizes these concepts in great detail. Here, we keep our focus on the
str
type that historically received its name as it models a string of characters . String is simply another term for sequence in the computer science literature.
Another example of a sequence is the list
type. Because of that, str
objects may be treated like list
objects in many situations.
Below, the built-in len() function tells us how many characters make up
text
. len() would not work with an "infinite" object. As anything modeled in a program must fit into a computer's finite memory, there cannot exist truly infinite objects; however, Chapter 8
introduces specialized iterable data types that can be used to model an infinite series of "things" and that, consequently, have no concept of "length."
text
'Lorem ipsum dolor sit amet.'
len(text)
27
Being iterable, we may loop over text
and do something with the individual characters, for example, print them out with extra space in between them. If it were not for the appropriately chosen name of the text
variable, we could not tell what concrete type of object the for
statement is looping over.
for character in text:
print(character, end=" ")
L o r e m i p s u m d o l o r s i t a m e t .
With the reversed() built-in, we may loop over
text
in reversed order. Reversing text
only works as it has a forward order to begin with.
for character in reversed(text):
print(character, end=" ")
. t e m a t i s r o l o d m u s p i m e r o L
Being a container, we may check if a given str
object is contained in text
with the in
operator, which has two distinct usages: First, it checks if a single character is contained in a str
object. Second, it may also check if a shorter str
object, then called a substring, is contained in a longer one.
"L" in text
True
"ipsum" in text
True
"veni, vidi, vici" in text
False
As str
objects are ordered and finite, we may index into them to obtain individual characters with the indexing operator []
. This is analogous to how we obtained individual elements of a list
object in Chapter 1 .
text[0]
'L'
text[1]
'o'
The index must be of type int
; othewise, we get a TypeError
.
text[1.0]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[54], line 1 ----> 1 text[1.0] TypeError: string indices must be integers, not 'float'
The last index is one less than the above "length" of the str
object as we start counting at 0
.
text[26] # == text[len(text) - 1]
'.'
An IndexError
is raised whenever the index is out of range.
text[27] # == text[len(text)]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[56], line 1 ----> 1 text[27] # == text[len(text)] IndexError: string index out of range
We may use negative indexes to start counting from the end of the str
object, as shown in the figure below. Note how this only works because sequences are finite.
Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Reverse | -27 | -26 | -25 | -24 | -23 | -22 | -21 | -20 | -19 | -18 | -17 | -16 | -15 | -14 | -13 | -12 | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |
Character | L |
o |
r |
e |
m |
|
i |
p |
s |
u |
m |
|
d |
o |
l |
o |
r |
|
s |
i |
t |
|
a |
m |
e |
t |
. |
text[-1]
'.'
text[-27] # == text[-len(text)]
'L'
One reason why programmers like to start counting at 0
is that a positive index and its corresponding negative index always add up to the length of the sequence. Here, 6
and 21
add to 27
.
text[6]
'i'
text[-21]
'i'
A slice is a substring of a str
object.
The slicing operator is a generalization of the indexing operator: We put one, two, or three integers within the brackets []
, separated by colons :
. The three integers are then referred to as the start, stop, and step values.
Let's start with two integers, start and stop. Whereas the character at the start position is included in the returned str
object, the one at the stop position is not. If both start and stop are positive, the difference "stop minus start" tells us how many characters the resulting slice has. So, below, 5 - 0 == 5
implies that "Lorem"
consists of 5
characters. So, colloquially speaking, text[0:5]
means "taking the first 5 - 0 == 5
characters of text
."
text[0:5]
'Lorem'
text[12:len(text)]
'dolor sit amet.'
If left out, start defaults to 0
and stop to the length of the str
object (i.e., the end).
text[:5]
'Lorem'
text[12:]
'dolor sit amet.'
Not including the character at the stop position makes working with individual slices easier as they add up to the original str
object again (cf., the "String Operations" section below regarding the overloaded +
operator).
text[:5] + text[5:]
'Lorem ipsum dolor sit amet.'
Slicing and indexing makes it easy to obtain shorter versions of the original str
object. A common application would be to parse out meaningful substrings from raw text data.
text[:11] + text[-10:]
'Lorem ipsum sit amet.'
By combining a positive start with a negative stop index, we specify both ends of the slice relative to the ends of the entire str
object. So, colloquially speaking, [6:-10]
below means "drop the first six and last ten characters." The length of the resulting slice can then not be calculated from the indexes and depends only on the length of the original str
object!
text[6:-10]
'ipsum dolor'
For convenience, the indexes do not need to lie within the range from 0
to len(text)
when slicing. So, no IndexError
is raised here.
text[-999:999]
'Lorem ipsum dolor sit amet.'
By leaving out both start and stop, we take a "full" slice that is essentially a copy of the original str
object.
text[:]
'Lorem ipsum dolor sit amet.'
A step value of i
can be used to obtain only every i
th character.
text[::2]
'Lrmismdlrstae.'
A negative step size of -1
reverses the order of the characters.
text[::-1]
'.tema tis rolod muspi meroL'
Whereas elements of a list
object may be re-assigned, as shortly hinted at in Chapter 1 , this is not allowed for the individual characters of
str
objects. Once created, they can not be changed. Formally, we say that str
objects are immutable. In that regard, they are like the numeric types in Chapter 5 .
On the contrary, objects that may be changed after creation, are called mutable. We already saw in Chapter 1 how mutable objects are more difficult to reason about for a beginner, in particular, if more than one variable references it. Yet, mutability does have its place in a programmer's toolbox, and we revisit this idea in the next chapters.
The TypeError
indicates that str
objects are immutable: Assignment to an index or a slice are not supported.
text[0] = "X"
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[72], line 1 ----> 1 text[0] = "X" TypeError: 'str' object does not support item assignment
text[:5] = "random"
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[73], line 1 ----> 1 text[:5] = "random" TypeError: 'str' object does not support item assignment
Objects of type str
come with many methods bound on them (cf., the documentation for a full list). As seen before, they work like normal functions and are accessed via the dot operator
.
. Calling a method is also referred to as method invocation.
The .find() method returns the index of the first occurrence of a character or a substring. If no match is found, it returns
-1
. A mirrored version searching from the right called .rfind() exists as well. The .index()
and .rindex()
methods work in the same way but raise a
ValueError
if no match is found. So, we can control if a search fails silently or loudly.
text
'Lorem ipsum dolor sit amet.'
text.find("a")
22
text.find("b")
-1
text.find("dolor")
12
.find() takes optional start and end arguments that allow us to find occurrences other than the first one.
text.find("o")
1
text.find("o", 2)
13
text.find("o", 2, 12)
-1
The .count() method does what we expect.
text
'Lorem ipsum dolor sit amet.'
text.count("l")
1
text.lower().count("l")
2
Alternatively, we can use the .upper() method and search for
"L"
s.
text.upper().count("L")
2
example = "random"
id(example)
140667840152112
lower = example.lower()
id(lower)
140667764453680
example
and lower
are different objects with the same value.
example is lower
False
example == lower
True
Besides .upper() and .lower()
there exist also .title()
and .swapcase()
methods.
text.lower()
'lorem ipsum dolor sit amet.'
text.upper()
'LOREM IPSUM DOLOR SIT AMET.'
text.title()
'Lorem Ipsum Dolor Sit Amet.'
text.swapcase()
'lOREM IPSUM DOLOR SIT AMET.'
Another popular string method is .split() : It separates a longer
str
object into smaller ones collected in a list
object. By default, groups of contiguous whitespace characters are used as the separator.
As an example, we use .split() to print out the individual words in
text
with more whitespace in between them.
text.split()
['Lorem', 'ipsum', 'dolor', 'sit', 'amet.']
for word in text.split():
print(word, end=" ")
Lorem ipsum dolor sit amet.
The opposite of splitting is done with the .join() method. It is typically invoked on a
str
object that represents a separator (e.g., " "
or ", "
) and connects the elements provided by an iterable argument (e.g., words
below) into one new str
object.
words = ["This", "will", "become", "a", "sentence."]
sentence = " ".join(words)
sentence
'This will become a sentence.'
As the str
object "abcde"
below is an iterable itself, its characters (!) are joined together with a space " "
in between.
" ".join("abcde")
'a b c d e'
The .replace() method creates a new
str
object with parts of the original str
object potentially replaced.
sentence.replace("will become", "is")
'This is a sentence.'
Note how sentence
itself remains unchanged. Bound to an immutable object, .replace() must create new objects.
sentence
'This will become a sentence.'
" text with whitespace ".strip()
'text with whitespace'
" text with whitespace ".lstrip()
'text with whitespace '
" text with whitespace ".rstrip()
' text with whitespace'
sentence.ljust(40)
'This will become a sentence. '
sentence.rjust(40)
' This will become a sentence.'
Similarly, the .zfill() method can be used to pad a
str
representation of a number with leading 0
s for justified output.
"42.87".zfill(10)
'0000042.87'
"-42.87".zfill(10)
'-000042.87'
As mentioned in Chapter 1 , the
+
and *
operators are overloaded and used for string concatenation. They always create new str
objects. That has nothing to do with the str
type's immutability, but is the default behavior of operators.
"Hello " + text[:4]
'Hello Lore'
5 * text[:12] + "..."
'Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum Lorem ipsum ...'
The relational operators also work with str
objects, another example of operator overloading. Comparison is done one character at a time in a pairwise fashion until the first pair differs or one operand ends. However, str
objects are sorted in a "weird" way. For example, all upper case characters come before all lower case characters. The reason for that is given in the "Characters are Numbers with a Convention" sub-section in the second part of this chapter.
"Apple" < "Banana"
True
"apple" < "Banana"
False
"apple" < "Banana".lower()
True
Below is an example with typical German last names that shows how characters other than the first decide the ordering.
"Mai" < "Maier" < "Mayer" < "Meier" < "Meyer"
True
Often, we want to use str
objects as drafts in the source code that are filled in with concrete text only at runtime. This approach is called string interpolation. There are three ways to do that in Python.
Formatted string literals , of f-strings for short, are the least recently added (cf., PEP 498
in 2016) and most readable way: We simply prepend a
str
in its literal notation with an f
, and put variables, or more generally, expressions, within curly braces {}
. These are then filled in when the string literal is evaluated.
name = "Alexander"
time_of_day = "morning"
f"Hello {name}! Good {time_of_day}."
'Hello Alexander! Good morning.'
Separated by a colon :
, various formatting options are available. In the beginning, the ability to round numbers for output may be particularly useful: This can be achieved by adding :.2f
to the variable name inside the curly braces, which casts the number as a float
and rounds it to two digits. The :.2f
is a so-called format specifier, and there exists a whole format specification mini-language to govern how specifiers work.
pi = 3.141592653
f"Pi is {pi:.2f}"
'Pi is 3.14'
str
objects also provide a .format() method that accepts an arbitrary number of positional arguments that are inserted into the
str
object in the same order replacing empty curly brackets {}
. String interpolation with the .format() method is a more traditional and probably the most common way as of today. While f-strings are the recommended way going forward, usage of the .format()
method is likely not declining any time soon.
"Hello {}! Good {}.".format(name, time_of_day)
'Hello Alexander! Good morning.'
We may use index numbers inside the curly braces if the order is different in the str
object.
"Good {1}, {0}".format(name, time_of_day)
'Good morning, Alexander'
The .format() method may alternatively be used with keyword arguments as well. Then, we must put the keywords' names within the curly brackets.
"Hello {name}! Good {time}.".format(name=name, time=time_of_day)
'Hello Alexander! Good morning.'
Format specifiers work as in the f-string case.
"Pi is {:.2f}".format(pi)
'Pi is 3.14'
%
Operator¶The %
operator that we saw in the context of modulo division in Chapter 1 is overloaded with string interpolation when its first operand is a
str
object. The second operand consists of all expressions to be filled in. Format specifiers work with a %
instead of curly braces and according to a different set of rules referred to as printf-style string formatting . So,
{:.2f}
becomes %.2f
.
This way of string interpolation is the oldest and originates from the C language . It is still widely spread, but we should use one of the other two ways instead. We show it here mainly for completeness sake.
"Pi is %.2f" % pi
'Pi is 3.14'
To insert more than one expression, we must list them in order and between parenthesis (
and )
. As Chapter 7 reveals, this literal syntax creates an object of type
tuple
. Also, to format an expression as text, we use the format specifier %s
.
"Hello %s! Good %s." % (name, time_of_day)
'Hello Alexander! Good morning.'