Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .
In this second part of the chapter, we look in more detail at how str
objects work in memory, in particular how the 0s and 1s in the memory translate into characters.
As previously seen, some characters have a special meaning when following the escape character "\"
. Besides escaping the kind of quote used as the str
object's delimiter, '
or "
, most of these escape sequences (i.e., "\"
with the subsequent character), act as a control character that moves the "cursor" in the output without generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the print() function. The documentation
lists all available escape sequences, of which we show the most important ones below.
The most common escape sequence is "\n"
that "prints" a newline character that is also called the line feed character or LF for short.
"This is a sentence\nthat is printed\non three lines."
'This is a sentence\nthat is printed\non three lines.'
print("This is a sentence\nthat is printed\non three lines.")
This is a sentence that is printed on three lines.
"\b"
is the backspace character , or BS for short, that moves the cursor back by one character.
print("ABC\bX")
ABX
print("ABC\bXY")
ABXY
Similarly, "\r"
is the carriage return character , or CR for short, that moves the cursor back to the beginning of the line.
print("ABC\rX")
XBC
print("ABC\rXY")
XYC
While Linux and modern MacOS systems use solely "\n"
to express a new line, Windows systems default to using "\r\n"
. This may lead to "weird" bugs on software projects where people using both kind of operating systems collaborate.
print("This is a sentence\r\nthat is printed\r\non three lines.")
This is a sentence that is printed on three lines.
"\t"
makes the cursor "jump" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results.
print("Jump\tfrom\ttab\tstop\tto\ttab\tstop.\nThe\tsecond\tline\tdoes\tso\ttoo.")
Jump from tab stop to tab stop. The second line does so too.
Sometimes we do not want the backslash "\"
and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character "\n"
does not makes sense here.
print("C:\Programs\new_application")
C:\Programs ew_application
<>:1: SyntaxWarning: invalid escape sequence '\P' <>:1: SyntaxWarning: invalid escape sequence '\P' /tmp/ipykernel_159416/1102122489.py:1: SyntaxWarning: invalid escape sequence '\P' print("C:\Programs\new_application")
Some str
objects even produce a SyntaxError
because the "\U"
can not be interpreted as a Unicode code point (cf., next section).
print("C:\Users\Administrator\Desktop\Project")
Cell In[10], line 1 print("C:\Users\Administrator\Desktop\Project") ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
A simple solution would be to escape the escape character with a second backslash "\"
.
print("C:\\Programs\\new_application")
C:\Programs\new_application
print("C:\\Users\\Administrator\\Desktop\\Project")
C:\Users\Administrator\Desktop\Project
However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a r
. The literal is then interpreted in a "raw" way.
print(r"C:\Programs\new_application")
C:\Programs\new_application
print(r"C:\Users\Administrator\Desktop\Project")
C:\Users\Administrator\Desktop\Project
So far, we used the term character without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.
Chapter 5 gives us an idea on how individual bits are used to express all types of numbers, from "simple"
int
objects to "complex" float
ones. To model characters, another layer of abstraction is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters.
Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called American Standard Code for Information Interchange , or ASCII for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers
0
through 127
.
A mapping from characters to numbers is referred to by the technical term encoding. We may use the built-in ord() function to encode any single character. The inverse to that is the built-in chr()
function, which decodes a number into a character.
ord("A")
65
chr(65)
'A'
Of course, unprintable escape sequences like "\n"
count as only one character.
ord("\n")
10
chr(10)
'\n'
In ASCII, the numbers 0
through 31
(and 127
) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers 48
through 57
, the upper case letters with 65
through 90
, and the lower case letters with 97
through 122
. While this seems random as first, there is of course a "sophisticated" system behind it. That can immediately be seen when looking at the encoded numbers in their binary representations.
For example, the digit 5
is mapped to the number 53
in ASCII. The binary representation of 53
is 0b_11_0101
and the least significant four bits, 0101
, mean 5. Similarly, the letter "E"
is the fifth letter in the alphabet. It is encoded with the number 69
in ASCII, which is 0b_100_0101
in binary. And, the least significant bits, 0_0101
, mean 5. Analogously, "e"
is encoded with 101
in ASCII, which is 0b_110_0101
in binary. And, the least significant bits, 0_0101
, mean 5 again. This encoding was chosen mainly because programmers "in the old days" needed to implement these encodings "by hand." Python abstracts that logic away from its users.
This encoding scheme is also the cause for the "weird" sorting in the "String Comparison" section in the first part of this chapter, where
"apple"
comes after "Banana"
. As "a"
is encoded with 97
and "B"
with 66
, "Banana"
must of course be "smaller" than "apple"
when comparison is done in a pairwise fashion of the individual characters.
for number in range(48, 58):
print(number, bin(number), "-> ", chr(number))
48 0b110000 -> 0 49 0b110001 -> 1 50 0b110010 -> 2 51 0b110011 -> 3 52 0b110100 -> 4 53 0b110101 -> 5 54 0b110110 -> 6 55 0b110111 -> 7 56 0b111000 -> 8 57 0b111001 -> 9
for i, number in enumerate(range(65, 91), start=1):
end = "\n" if i % 3 == 0 else "\t"
print(number, bin(number), "-> ", chr(number), end=end)
65 0b1000001 -> A 66 0b1000010 -> B 67 0b1000011 -> C 68 0b1000100 -> D 69 0b1000101 -> E 70 0b1000110 -> F 71 0b1000111 -> G 72 0b1001000 -> H 73 0b1001001 -> I 74 0b1001010 -> J 75 0b1001011 -> K 76 0b1001100 -> L 77 0b1001101 -> M 78 0b1001110 -> N 79 0b1001111 -> O 80 0b1010000 -> P 81 0b1010001 -> Q 82 0b1010010 -> R 83 0b1010011 -> S 84 0b1010100 -> T 85 0b1010101 -> U 86 0b1010110 -> V 87 0b1010111 -> W 88 0b1011000 -> X 89 0b1011001 -> Y 90 0b1011010 -> Z
for i, number in enumerate(range(97, 123), start=1):
end = "\n" if i % 3 == 0 else "\t"
print(str(number).rjust(3), bin(number), "-> ", chr(number), end=end)
97 0b1100001 -> a 98 0b1100010 -> b 99 0b1100011 -> c 100 0b1100100 -> d 101 0b1100101 -> e 102 0b1100110 -> f 103 0b1100111 -> g 104 0b1101000 -> h 105 0b1101001 -> i 106 0b1101010 -> j 107 0b1101011 -> k 108 0b1101100 -> l 109 0b1101101 -> m 110 0b1101110 -> n 111 0b1101111 -> o 112 0b1110000 -> p 113 0b1110001 -> q 114 0b1110010 -> r 115 0b1110011 -> s 116 0b1110100 -> t 117 0b1110101 -> u 118 0b1110110 -> v 119 0b1110111 -> w 120 0b1111000 -> x 121 0b1111001 -> y 122 0b1111010 -> z
The remaining symbols
encoded in ASCII are encoded with the numbers still unused, which is why they are scattered.
symbols = (
list(range(32, 48))
+ list(range(58, 65))
+ list(range(91, 97))
+ list(range(123, 127))
)
for i, number in enumerate(symbols, start=1):
end = "\n" if i % 3 == 0 else "\t"
print(str(number).rjust(3), bin(number).rjust(10), "-> ", chr(number), end=end)
32 0b100000 -> 33 0b100001 -> ! 34 0b100010 -> " 35 0b100011 -> # 36 0b100100 -> $ 37 0b100101 -> % 38 0b100110 -> & 39 0b100111 -> ' 40 0b101000 -> ( 41 0b101001 -> ) 42 0b101010 -> * 43 0b101011 -> + 44 0b101100 -> , 45 0b101101 -> - 46 0b101110 -> . 47 0b101111 -> / 58 0b111010 -> : 59 0b111011 -> ; 60 0b111100 -> < 61 0b111101 -> = 62 0b111110 -> > 63 0b111111 -> ? 64 0b1000000 -> @ 91 0b1011011 -> [ 92 0b1011100 -> \ 93 0b1011101 -> ] 94 0b1011110 -> ^ 95 0b1011111 -> _ 96 0b1100000 -> ` 123 0b1111011 -> { 124 0b1111100 -> | 125 0b1111101 -> } 126 0b1111110 -> ~
As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are ISO 8859-1 for western European letters or Windows 1250
for Latin ones. Many of these encodings use 8-bit numbers (i.e.,
0
through 255
) to map the multitude of non-English letters (e.g., the German umlauts
"ä"
, "ö"
, "ü"
, or "ß"
).
However, none of these specialized encodings can map all characters of all languages around the world from all times in human history. To achieve that, a truly global standard called Unicode was developed and its first version released in 1991. Since then, Unicode has been amended with many other "characters." The most popular among them being emojis
or the Klingon
language (from the science fiction series Star Trek
). In Unicode, every character is given an identity referred to as the code point. Code points are hexadecimal numbers from
0x0000
through 0x10ffff
, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most 1,114,112 code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first 127
code points are identical to the ASCII encoding for reasons explained in the "The bytes
Type" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., Wikipedia ).
All we need to know to print a character is its code point. Python uses the escape sequence "\U"
that is followed by eight hexadecimal digits. Underscore separators are unfortunately not allowed here.
So, to print a smiley, we just need to look up the corresponding number (e.g., here ).
"\U0001f604"
'😄'
Every Unicode character also has a descriptive name that we can use with the escape sequence "\N"
and within curly braces {}
.
"\N{FACE WITH TEARS OF JOY}"
'😂'
Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence "\u"
for brevity.
"\U00000041" # hex(65) == 0x41
'A'
"\u0041"
'A'
Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence "\x"
for even conciser code.
"\x41"
'A'
As the str
type is based on Unicode, a str
object's behavior is more in line with how humans view text and not how it is expressed in source code.
For example, while it is obvious that len("A")
evaluates to 1
, ...
len("A")
1
... what should len("\N{SNAKE}")
evaluate to? As the idea of a snake is expressed as one "character," len() also returns
1
here.
"\N{SNAKE}"
'🐍'
len("\N{SNAKE}")
1
Many of the built-in str
methods also consider Unicode. For example, in contrast to lower() , the casefold()
method knows that the German
"ß"
is commonly converted to "ss"
. So, when searching for exact matches, normalizing text with casefold() may yield better results than with lower()
.
"Straße".lower()
'straße'
"Straße".casefold()
'strasse'
Many other methods like isdecimal() , isdigit()
, isnumeric()
, isprintable()
, isidentifier()
, and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning.
Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of PEP 8 or because the text consists of many lines and typing out
"\n"
is tedious. However, using single double quotes "
around multiple lines results in a SyntaxError
.
"
Do not break the lines like this
"
Cell In[34], line 1 " ^ SyntaxError: unterminated string literal (detected at line 1)
Instead, we may enclose a string literal with either triple double quotes """
or triple single quotes '''
. Then, newline characters in the source code are converted into "\n"
characters in the resulting str
object. Docstrings are precisely that, and, by convention, always written within triple double quotes """
.
multi_line = """
I am a multi-line string
consisting of four lines.
"""
A caveat is that "\n"
characters are often inserted at the beginning or end of the text when we try to format the source code nicely.
multi_line
'\nI am a multi-line string\nconsisting of four lines.\n'
print(multi_line)
I am a multi-line string consisting of four lines.
Using the split() method with the optional
sep
argument, we confirm that multi_line
consists of four lines with the first and last line being empty.
for i, line in enumerate(multi_line.split("\n"), start=1):
print(i, line)
1 2 I am a multi-line string 3 consisting of four lines. 4
To mitigate that, we often see the strip() method in source code.
multi_line = """
I am a multi-line string
consisting of two lines.
""".strip()
for i, line in enumerate(multi_line.split("\n"), start=1):
print(i, line)
1 I am a multi-line string 2 consisting of two lines.
bytes
Type¶To end this chapter, we want to briefly look at the bytes
data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).
Let's open a binary file in read-only mode (i.e., mode="rb"
) and read in all of its contents.
with open("full_house.bin", mode="rb") as binary_file:
data = binary_file.read()
data
is an object of type bytes
.
id(data)
139880714782512
type(data)
bytes
It's value is given out in the literal bytes notation with a b
prefix (cf., the reference ). Every byte is expressed in hexadecimal representation with the escape sequence
"\x"
. This representation is commonly chosen as we can not tell what kind of information is hidden in the data
by just looking at the bytes. Instead, we must be told by some other source how to decode the raw bytes into information we can interpret.
data
b'\xf0\x9f\x82\xa7\xf0\x9f\x82\xb7\xf0\x9f\x83\x97\xf0\x9f\x83\x8e\xf0\x9f\x83\x9e'
bytes
objects work like str
objects in many ways. In particular, they are sequences as well: The number of bytes is finite and we may iterate over them in order.
len(data)
20
Consisting of 8 bits, a single byte can always be interpreted as a whole number between 0
through 255
. That is exactly what we see when we loop over the data
...
for byte in data:
print(byte, end=" ")
240 159 130 167 240 159 130 183 240 159 131 151 240 159 131 142 240 159 131 158
... or index into them.
data[-1]
158
Slicing returns another bytes
object.
data[::2]
b'\xf0\x82\xf0\x82\xf0\x83\xf0\x83\xf0\x83'
cards = data.decode()
type(cards)
str
So, data
consisted of a full house hand in a poker game.
cards
'🂧🂷🃗🃎🃞'
To go the opposite direction and encode a given str
object, we use the str
type's encode() method.
place = "Café Kastanientörtchen"
place.encode()
b'Caf\xc3\xa9 Kastanient\xc3\xb6rtchen'
place.encode("iso-8859-1")
b'Caf\xe9 Kastanient\xf6rtchen'
However, we must use the same encoding for the decoding step as for the encoding step. Otherwise, a UnicodeDecodeError
is raised.
place.encode("iso-8859-1").decode()
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[55], line 1 ----> 1 place.encode("iso-8859-1").decode() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte
Not all encodings map all Unicode code points. For example "iso-8859-1"
does not know Czech letters. Below, encode() raises a
UnicodeEncodeError
because of that.
"Dobrý den, přátelé!".encode("iso-8859-1")
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) Cell In[56], line 1 ----> 1 "Dobrý den, přátelé!".encode("iso-8859-1") UnicodeEncodeError: 'latin-1' codec can't encode character '\u0159' in position 12: ordinal not in range(256)
The open() function takes an optional
encoding
argument as well.
with open("umlauts.txt") as file:
print("".join(file.readlines()))
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[57], line 2 1 with open("umlauts.txt") as file: ----> 2 print("".join(file.readlines())) File <frozen codecs>:322, in decode(self, input, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte
with open("umlauts.txt", encoding="iso-8859-1") as file:
print("".join(file.readlines()))
Lerchen-Lärchen-Ähnlichkeiten fehlen. Dieses abzustreiten mag im Klang der Worte liegen. Merke, eine Lerch' kann fliegen, Lärchen nicht, was kaum verwundert, denn nicht eine unter hundert ist geflügelt. Auch im Singen sind die Bäume zu bezwingen. Die Bätrachtung sollte reichen, Rächtschreibfählern auszuweichen. Leicht gälingt's, zu unterscheiden, wär ist wär nun von dän beiden.
A best practice is to always specify the encoding
, especially on computers running on Windows (cf., the talk by Łukasz Langa in the Further Resources ) section at the end of this chapter.
Below is the first example involving open() one last time: It shows how all the contents of a text file should be read into one
str
object.
with open("lorem_ipsum.txt", encoding="utf-8") as file:
content = "".join(file.readlines())
content
"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\nwhen an unknown printer took a galley of type and scrambled it to make a type\nspecimen book. It has survived not only five centuries but also the leap into\nelectronic typesetting, remaining essentially unchanged. It was popularised in\nthe 1960s with the release of Letraset sheets.\n"
print(content)
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets.