Note: Click on "Kernel" > "Restart Kernel and Clear All Outputs" in JupyterLab before reading this notebook to reset its output. If you cannot run this file on your machine, you may want to open it in the cloud .

Chapter 6: Text & Bytes (continued)¶

In this second part of the chapter, we look in more detail at how str objects work in memory, in particular how the $0$ s and $1$ s in the memory translate into characters.

Special Characters¶

As previously seen, some characters have a special meaning when following the escape character "\". Besides escaping the kind of quote used as the str object's delimiter, ' or ", most of these escape sequences (i.e., "\" with the subsequent character), act as a control character that moves the "cursor" in the output without generating any pixel on the screen. Because of that, we only see the effect of such escape sequences when used with the print() function. The documentation lists all available escape sequences, of which we show the most important ones below.

The most common escape sequence is "\n" that "prints" a newline character that is also called the line feed character or LF for short.

In [1]:

"This is a sentence\nthat is printed\non three lines."

Out[1]:

'This is a sentence\nthat is printed\non three lines.'

In [2]:

print("This is a sentence\nthat is printed\non three lines.")

This is a sentence
that is printed
on three lines.

"\b" is the backspace character , or BS for short, that moves the cursor back by one character.

In [3]:

print("ABC\bX")

ABX

In [4]:

print("ABC\bXY")

ABXY

Similarly, "\r" is the carriage return character , or CR for short, that moves the cursor back to the beginning of the line.

In [5]:

print("ABC\rX")

XBC

In [6]:

print("ABC\rXY")

XYC

While Linux and modern MacOS systems use solely "\n" to express a new line, Windows systems default to using "\r\n". This may lead to "weird" bugs on software projects where people using both kind of operating systems collaborate.

In [7]:

print("This is a sentence\r\nthat is printed\r\non three lines.")

This is a sentence
that is printed
on three lines.

"\t" makes the cursor "jump" in equidistant tab stops. That may be useful for formatting a program with lengthy and tabular results.

In [8]:

print("Jump\tfrom\ttab\tstop\tto\ttab\tstop.\nThe\tsecond\tline\tdoes\tso\ttoo.")

Jump	from	tab	stop	to	tab	stop.
The	second	line	does	so	too.

Raw Strings¶

Sometimes we do not want the backslash "\" and its subsequent character be interpreted as an escape sequence. For example, let's print a typical installation path on a Windows systems. Obviously, the newline character "\n" does not makes sense here.

In [9]:

print("C:\Programs\new_application")

C:\Programs
ew_application

<>:1: SyntaxWarning: invalid escape sequence '\P'
<>:1: SyntaxWarning: invalid escape sequence '\P'
/tmp/ipykernel_159416/1102122489.py:1: SyntaxWarning: invalid escape sequence '\P'
  print("C:\Programs\new_application")

Some str objects even produce a SyntaxError because the "\U" can not be interpreted as a Unicode code point (cf., next section).

In [10]:

print("C:\Users\Administrator\Desktop\Project")

  Cell In[10], line 1
    print("C:\Users\Administrator\Desktop\Project")
          ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

A simple solution would be to escape the escape character with a second backslash "\".

In [11]:

print("C:\\Programs\\new_application")

C:\Programs\new_application

In [12]:

print("C:\\Users\\Administrator\\Desktop\\Project")

C:\Users\Administrator\Desktop\Project

However, this is tedious to remember and type. For such use cases, Python allows to prefix any string literal with a r. The literal is then interpreted in a "raw" way.

In [13]:

print(r"C:\Programs\new_application")

C:\Programs\new_application

In [14]:

print(r"C:\Users\Administrator\Desktop\Project")

C:\Users\Administrator\Desktop\Project

Characters are Numbers with a Convention¶

So far, we used the term character without any further consideration. In this section, we briefly look into what characters are and how they are modeled in software.

Chapter 5 gives us an idea on how individual bits are used to express all types of numbers, from "simple" int objects to "complex" float ones. To model characters, another layer of abstraction is put on top of whole numbers. So, just as bits are used to express integers, they themselves are used to express characters.

ASCII¶

Many conventions have been developed as to what integer is associated with which character. The most basic one that was also adopted around the world is the the so-called American Standard Code for Information Interchange , or ASCII for short. It uses 7 bits of information to map the unprintable control characters as well as the printable letters of the alphabet, numbers, and common symbols to the numbers 0 through 127.

A mapping from characters to numbers is referred to by the technical term encoding. We may use the built-in ord() function to encode any single character. The inverse to that is the built-in chr() function, which decodes a number into a character.

In [15]:

ord("A")

Out[15]:

In [16]:

chr(65)

Out[16]:

'A'

Of course, unprintable escape sequences like "\n" count as only one character.

In [17]:

ord("\n")

Out[17]:

In [18]:

chr(10)

Out[18]:

'\n'

In ASCII, the numbers 0 through 31 (and 127) are mapped to all kinds of unprintable control characters. The decimal digits are encoded with the numbers 48 through 57, the upper case letters with 65 through 90, and the lower case letters with 97 through 122. While this seems random as first, there is of course a "sophisticated" system behind it. That can immediately be seen when looking at the encoded numbers in their binary representations.

For example, the digit 5 is mapped to the number 53 in ASCII. The binary representation of 53 is 0b_11_0101 and the least significant four bits, 0101, mean $5$ . Similarly, the letter "E" is the fifth letter in the alphabet. It is encoded with the number 69 in ASCII, which is 0b_100_0101 in binary. And, the least significant bits, 0_0101, mean $5$ . Analogously, "e" is encoded with 101 in ASCII, which is 0b_110_0101 in binary. And, the least significant bits, 0_0101, mean $5$ again. This encoding was chosen mainly because programmers "in the old days" needed to implement these encodings "by hand." Python abstracts that logic away from its users.

This encoding scheme is also the cause for the "weird" sorting in the "String Comparison" section in the first part of this chapter, where "apple" comes after "Banana". As "a" is encoded with 97 and "B" with 66, "Banana" must of course be "smaller" than "apple" when comparison is done in a pairwise fashion of the individual characters.

In [19]:

for number in range(48, 58):
    print(number, bin(number), "-> ", chr(number))

48 0b110000 ->  0
49 0b110001 ->  1
50 0b110010 ->  2
51 0b110011 ->  3
52 0b110100 ->  4
53 0b110101 ->  5
54 0b110110 ->  6
55 0b110111 ->  7
56 0b111000 ->  8
57 0b111001 ->  9

In [20]:

for i, number in enumerate(range(65, 91), start=1):
    end = "\n" if i % 3 == 0 else "\t"
    print(number, bin(number), "-> ", chr(number), end=end)

65 0b1000001 ->  A	66 0b1000010 ->  B	67 0b1000011 ->  C
68 0b1000100 ->  D	69 0b1000101 ->  E	70 0b1000110 ->  F
71 0b1000111 ->  G	72 0b1001000 ->  H	73 0b1001001 ->  I
74 0b1001010 ->  J	75 0b1001011 ->  K	76 0b1001100 ->  L
77 0b1001101 ->  M	78 0b1001110 ->  N	79 0b1001111 ->  O
80 0b1010000 ->  P	81 0b1010001 ->  Q	82 0b1010010 ->  R
83 0b1010011 ->  S	84 0b1010100 ->  T	85 0b1010101 ->  U
86 0b1010110 ->  V	87 0b1010111 ->  W	88 0b1011000 ->  X
89 0b1011001 ->  Y	90 0b1011010 ->  Z

In [21]:

for i, number in enumerate(range(97, 123), start=1):
    end = "\n" if i % 3 == 0 else "\t"
    print(str(number).rjust(3), bin(number), "-> ", chr(number), end=end)

 97 0b1100001 ->  a	 98 0b1100010 ->  b	 99 0b1100011 ->  c
100 0b1100100 ->  d	101 0b1100101 ->  e	102 0b1100110 ->  f
103 0b1100111 ->  g	104 0b1101000 ->  h	105 0b1101001 ->  i
106 0b1101010 ->  j	107 0b1101011 ->  k	108 0b1101100 ->  l
109 0b1101101 ->  m	110 0b1101110 ->  n	111 0b1101111 ->  o
112 0b1110000 ->  p	113 0b1110001 ->  q	114 0b1110010 ->  r
115 0b1110011 ->  s	116 0b1110100 ->  t	117 0b1110101 ->  u
118 0b1110110 ->  v	119 0b1110111 ->  w	120 0b1111000 ->  x
121 0b1111001 ->  y	122 0b1111010 ->  z

The remaining symbols encoded in ASCII are encoded with the numbers still unused, which is why they are scattered.

In [22]:

symbols = (
    list(range(32, 48))
    + list(range(58, 65))
    + list(range(91, 97))
    + list(range(123, 127))
)

In [23]:

for i, number in enumerate(symbols, start=1):
    end = "\n" if i % 3 == 0 else "\t"
    print(str(number).rjust(3), bin(number).rjust(10), "-> ", chr(number), end=end)

 32   0b100000 ->   	 33   0b100001 ->  !	 34   0b100010 ->  "
 35   0b100011 ->  #	 36   0b100100 ->  $	 37   0b100101 ->  %
 38   0b100110 ->  &	 39   0b100111 ->  '	 40   0b101000 ->  (
 41   0b101001 ->  )	 42   0b101010 ->  *	 43   0b101011 ->  +
 44   0b101100 ->  ,	 45   0b101101 ->  -	 46   0b101110 ->  .
 47   0b101111 ->  /	 58   0b111010 ->  :	 59   0b111011 ->  ;
 60   0b111100 ->  <	 61   0b111101 ->  =	 62   0b111110 ->  >
 63   0b111111 ->  ?	 64  0b1000000 ->  @	 91  0b1011011 ->  [
 92  0b1011100 ->  \	 93  0b1011101 ->  ]	 94  0b1011110 ->  ^
 95  0b1011111 ->  _	 96  0b1100000 ->  `	123  0b1111011 ->  {
124  0b1111100 ->  |	125  0b1111101 ->  }	126  0b1111110 ->  ~

As the ASCII character set does not work for many languages other than English, various encodings were developed. Popular examples are ISO 8859-1 for western European letters or Windows 1250 for Latin ones. Many of these encodings use 8-bit numbers (i.e., 0 through 255) to map the multitude of non-English letters (e.g., the German umlauts "ä", "ö", "ü", or "ß").

Unicode¶

However, none of these specialized encodings can map all characters of all languages around the world from all times in human history. To achieve that, a truly global standard called Unicode was developed and its first version released in 1991. Since then, Unicode has been amended with many other "characters." The most popular among them being emojis or the Klingon language (from the science fiction series Star Trek ). In Unicode, every character is given an identity referred to as the code point. Code points are hexadecimal numbers from 0x0000 through 0x10ffff, written as U+0000 and U+10FFFF outside of Python. Consequently, there exist at most $1,114,112$ code points, of which only about 10% are currently in use, allowing lots of room for new characters to be invented. The first 127 code points are identical to the ASCII encoding for reasons explained in the "The bytes Type" section further below. There exist plenty of lists of all Unicode characters on the web (e.g., Wikipedia ).

All we need to know to print a character is its code point. Python uses the escape sequence "\U" that is followed by eight hexadecimal digits. Underscore separators are unfortunately not allowed here.

So, to print a smiley, we just need to look up the corresponding number (e.g., here ).

In [24]:

"\U0001f604"

Out[24]:

'😄'

Every Unicode character also has a descriptive name that we can use with the escape sequence "\N" and within curly braces {}.

In [25]:

"\N{FACE WITH TEARS OF JOY}"

Out[25]:

'😂'

Whenever the code point can be expressed with just four hexadecimal digits, we may use the escape sequence "\u" for brevity.

In [26]:

"\U00000041"  # hex(65) == 0x41

Out[26]:

'A'

In [27]:

"\u0041"

Out[27]:

'A'

Analogously, if the code point can be expressed with two hexadecimal digits, we may use the escape sequence "\x" for even conciser code.

In [28]:

"\x41"

Out[28]:

'A'

As the str type is based on Unicode, a str object's behavior is more in line with how humans view text and not how it is expressed in source code.

For example, while it is obvious that len("A") evaluates to 1, ...

In [29]:

len("A")

Out[29]:

... what should len("\N{SNAKE}") evaluate to? As the idea of a snake is expressed as one "character," len() also returns 1 here.

In [30]:

"\N{SNAKE}"

Out[30]:

'🐍'

In [31]:

len("\N{SNAKE}")

Out[31]:

Many of the built-in str methods also consider Unicode. For example, in contrast to lower() , the casefold() method knows that the German "ß" is commonly converted to "ss". So, when searching for exact matches, normalizing text with casefold() may yield better results than with lower() .

In [32]:

"Straße".lower()

Out[32]:

'straße'

In [33]:

"Straße".casefold()

Out[33]:

'strasse'

Many other methods like isdecimal() , isdigit() , isnumeric() , isprintable() , isidentifier() , and many more may be worthwhile to know for the data science practitioner, especially when it comes to data cleaning.

Multi-line Strings¶

Sometimes, it is convenient to split text across multiple lines in source code. For example, to make lines fit into the 79 characters requirement of PEP 8 or because the text consists of many lines and typing out "\n" is tedious. However, using single double quotes " around multiple lines results in a SyntaxError.

In [34]:

"
Do not break the lines like this
"

  Cell In[34], line 1
    "
    ^
SyntaxError: unterminated string literal (detected at line 1)

Instead, we may enclose a string literal with either triple double quotes """ or triple single quotes '''. Then, newline characters in the source code are converted into "\n" characters in the resulting str object. Docstrings are precisely that, and, by convention, always written within triple double quotes """.

In [35]:

multi_line = """
I am a multi-line string
consisting of four lines.
"""

A caveat is that "\n" characters are often inserted at the beginning or end of the text when we try to format the source code nicely.

In [36]:

multi_line

Out[36]:

'\nI am a multi-line string\nconsisting of four lines.\n'

In [37]:

print(multi_line)

I am a multi-line string
consisting of four lines.

Using the split() method with the optional sep argument, we confirm that multi_line consists of four lines with the first and last line being empty.

In [38]:

for i, line in enumerate(multi_line.split("\n"), start=1):
    print(i, line)

1 
2 I am a multi-line string
3 consisting of four lines.
4

To mitigate that, we often see the strip() method in source code.

In [39]:

multi_line = """
I am a multi-line string
consisting of two lines.
""".strip()

In [40]:

for i, line in enumerate(multi_line.split("\n"), start=1):
    print(i, line)

1 I am a multi-line string
2 consisting of two lines.

The `bytes` Type¶

To end this chapter, we want to briefly look at the bytes data type, which conceptually is a sequence of bytes. That data format is probably one of the most generic ways of exchanging data between any two programs or computers (e.g., a web browser obtains its data from a web server in this format).

Let's open a binary file in read-only mode (i.e., mode="rb") and read in all of its contents.

In [41]:

with open("full_house.bin", mode="rb") as binary_file:
    data = binary_file.read()

data is an object of type bytes.

In [42]:

id(data)

Out[42]:

139880714782512

In [43]:

type(data)

Out[43]:

bytes

It's value is given out in the literal bytes notation with a b prefix (cf., the reference ). Every byte is expressed in hexadecimal representation with the escape sequence "\x". This representation is commonly chosen as we can not tell what kind of information is hidden in the data by just looking at the bytes. Instead, we must be told by some other source how to decode the raw bytes into information we can interpret.

In [44]:

data

Out[44]:

b'\xf0\x9f\x82\xa7\xf0\x9f\x82\xb7\xf0\x9f\x83\x97\xf0\x9f\x83\x8e\xf0\x9f\x83\x9e'

bytes objects work like str objects in many ways. In particular, they are sequences as well: The number of bytes is finite and we may iterate over them in order.

In [45]:

len(data)

Out[45]:

Consisting of 8 bits, a single byte can always be interpreted as a whole number between 0 through 255. That is exactly what we see when we loop over the data ...

In [46]:

for byte in data:
    print(byte, end=" ")

240 159 130 167 240 159 130 183 240 159 131 151 240 159 131 142 240 159 131 158

... or index into them.

In [47]:

data[-1]

Out[47]:

Slicing returns another bytes object.

In [48]:

data[::2]

Out[48]:

b'\xf0\x82\xf0\x82\xf0\x83\xf0\x83\xf0\x83'

Character Encodings¶

Luckily, data consists of bytes encoded with the UTF-8 encoding. That is the most common way of mapping a Unicode character's code point to a sequence of bytes.

To obtain a str object out of a given bytes object, we decode it with the bytes type's decode() method.

In [49]:

cards = data.decode()

In [50]:

type(cards)

Out[50]:

str

So, data consisted of a full house hand in a poker game.

In [51]:

cards

Out[51]:

'🂧🂷🃗🃎🃞'

To go the opposite direction and encode a given str object, we use the str type's encode() method.

In [52]:

place = "Café Kastanientörtchen"

In [53]:

place.encode()

Out[53]:

b'Caf\xc3\xa9 Kastanient\xc3\xb6rtchen'

By default, encode() and decode() use an encoding="utf-8" argument. We may use another encoding like, for example, "iso-8859-1", which can deal with ASCII and western European letters.

In [54]:

place.encode("iso-8859-1")

Out[54]:

b'Caf\xe9 Kastanient\xf6rtchen'

However, we must use the same encoding for the decoding step as for the encoding step. Otherwise, a UnicodeDecodeError is raised.

In [55]:

place.encode("iso-8859-1").decode()

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[55], line 1
----> 1 place.encode("iso-8859-1").decode()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte

Not all encodings map all Unicode code points. For example "iso-8859-1" does not know Czech letters. Below, encode() raises a UnicodeEncodeError because of that.

In [56]:

"Dobrý den, přátelé!".encode("iso-8859-1")

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[56], line 1
----> 1 "Dobrý den, přátelé!".encode("iso-8859-1")

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0159' in position 12: ordinal not in range(256)

Reading Files (continued)¶

The open() function takes an optional encoding argument as well.

In [57]:

with open("umlauts.txt") as file:
    print("".join(file.readlines()))

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[57], line 2
      1 with open("umlauts.txt") as file:
----> 2     print("".join(file.readlines()))

File <frozen codecs>:322, in decode(self, input, final)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 9: invalid continuation byte

In [58]:

with open("umlauts.txt", encoding="iso-8859-1") as file:
    print("".join(file.readlines()))

Lerchen-Lärchen-Ähnlichkeiten
fehlen. Dieses abzustreiten
mag im Klang der Worte liegen.
Merke, eine Lerch' kann fliegen,
Lärchen nicht, was kaum verwundert,
denn nicht eine unter hundert
ist geflügelt. Auch im Singen
sind die Bäume zu bezwingen.
Die Bätrachtung sollte reichen,
Rächtschreibfählern auszuweichen.
Leicht gälingt's, zu unterscheiden,
wär ist wär nun von dän beiden.

Best Practice: Use UTF-8 explicitly¶

A best practice is to always specify the encoding, especially on computers running on Windows (cf., the talk by Łukasz Langa in the Further Resources ) section at the end of this chapter.

Below is the first example involving open() one last time: It shows how all the contents of a text file should be read into one str object.

In [59]:

with open("lorem_ipsum.txt", encoding="utf-8") as file:
    content = "".join(file.readlines())

In [60]:

content

Out[60]:

"Lorem Ipsum is simply dummy text of the printing and typesetting industry.\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s\nwhen an unknown printer took a galley of type and scrambled it to make a type\nspecimen book. It has survived not only five centuries but also the leap into\nelectronic typesetting, remaining essentially unchanged. It was popularised in\nthe 1960s with the release of Letraset sheets.\n"

In [61]:

print(content)

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type
specimen book. It has survived not only five centuries but also the leap into
electronic typesetting, remaining essentially unchanged. It was popularised in
the 1960s with the release of Letraset sheets.