Notebook

From bits to characters¶

You might have heard that computers operate in binary, with 1s and 0s. But they are also filled with numbers, words, images, and sounds. This chapter will teach you the conventions that have evolved over many years for encoding complex data as sequences of bits, and decoding those bits back into complex data.

I have two reasons for starting here. The first is practical. Many problems you will encounter in working with text are caused by inconsistencies in how we encode and decode text data. Knowing how data is actually represented can help you recognize and solve problems that might otherwise seem bizarre and incomprehensible. The second reason is emotional. If you know down to the very bits how a system works, it will seem less mysterious to you. You will feel more confident and less willing to give up if something isn't working how you think it should.

After working through this chapter you should be able to

work with binary and hexadecimal representations of numbers
describe different ways to represent characters as numbers
explain the motivation for the Unicode standard
describe the relationship between Unicode and byte encodings such as UTF-8

Numbers¶

So what can we do with a sequence of bits, binary digits that can be 1 or 0? The simplest thing is to represent is whole numbers. With one bit we can represent two numbers, 0 and 1. With two bits we can represent four numbers, from 0 to 3, by declaring that one bit represents 1 and the second bit represents 2.

In Python we can write these numbers literally with the 0b prefix:

In [1]:

[0b00, 0b01, 0b10, 0b11]

Out[1]:

[0, 1, 2, 3]

Variables tend to be defined in terms of bytes (groups of eight bits) rather than bits to keep memory managment simpler. With one byte we can represent 256 numbers, from 0 to 255, by declaring that the leftmost bit represents 128 and the rightmost bit represents 1.

In [2]:

0b10000001

Out[2]:

We can view the binary representation of a number by formatting a string. Here I'm asking for a string that shows the binary digits of two numbers, padded out to 8 digits with zeros:

In [3]:

"{:08b} {:08b}".format(170, 15)

Out[3]:

'10101010 00001111'

We can also display numbers in base 16, where the letters a-f represent 10-15 in decimal. This conveniently maps each group of four bits to one base-16 digit:

In [4]:

"{:02x} {:02x}".format(170, 15)

Out[4]:

'aa 0f'

If you add more bytes you can represent larger numbers. A typical size for an integer is four bytes, with a maximum value of a little over 4 billion, but in most cases the actual maximum is half that because we want to represent both positive and negative numbers. To do that we reserve the leftmost bit to represent the sign. The sign bit can cause strange behavior in very rare circumstances: if you are counting more than two billion of something, it's possible your numbers could go negative! But you almost certainly will not have to think about this -- it just works.

What about numbers with decimals? The trick for floating point numbers is to use something like scientific notation. 1.5 is 15 * 10^-1, so we can represent it as two integers: 15 and -1. We can therefore take a sequence of bits and declare that some of them represent the number and the rest represent the exponent (in binary). Again, this just works. There's a standard that almost all computers follow.

Letters¶

Once we move from the clean, mathematical world of numbers to the messy, culturally situated world of characters, things get more difficult. There's a lot less agreement and standardization, which is what makes dealing with text difficult. But the basic idea is simple: since we know how to deal with numbers, we need to find a way to transform letters into numbers. This transformation is called encoding.

Since there are 26 letters in the English alphabet and they have a fixed, well-known order, we can encode each letter as a number. "A" is 1, "B" is 2, up to "Z" at 26. Converting to binary, "A" is 0b1, "B" is 0b10, and "Z" is 0b11010. Since 26 is between 15 (2⁴ - 1) and 31 (2⁵ - 1), we need five bits to represent all the letters.

It turns out that this five-bit binary representation of letters was invented by Francis Bacon four hundred years ago (thanks to Dennis Tenen for the tip). Bacon devised it for secretly sending messages hidden in text printed using two slightly different typefaces. Five-bit letter encoding is a good start, but AFEWTHINGSAREMISSING. To fully represent text we're going to need to go beyond letters to characters.

Characters¶

Text isn't just letters. We also need upper and lower case letters, punctuation, whitespace (spaces, tabs, newlines), and digits, at the very minimum. And here we start to run into a problem. The English alphabet has an order -- you can find its Latin, Etruscan, Greek, and Phoenecian antecedents scratched on pot sherds in pretty much the same order for the last 3000 years. The digits from 0 to 9 are also in a sensible order. But does "9" come before "A" or after "Z"? What about "A" and "a"? Or even worse, "-" and "?". At this point we don't have tradition to guide us. Initially, every computer vendor had its own way of mapping characters to numbers, making it hard to move files from one system to another. Someone needed to make a decision and everyone had to agree to it.

For the characters that typically occur in English, the standard that has emerged is called ASCII. Before ASCII, every computer vendor had its own standard, so files couldn't be transferred from one system to another. ASCII is a seven-bit standard, so it can map 128 characters to numbers from 0 to 127. Much like a house that has been added onto many times, ASCII still forms the basis of modern character encoding systems, including the one that is the default in Python 3. To go from single character strings to numbers in Python we use the ord function. So how does ASCII encode letters, numbers, and punctuation?

In [5]:

[ord('a'), ord("z"), ord('A'), ord('Z'), ord('7'), ord(' '), ord('-'), ord('?')]

Out[5]:

[97, 122, 65, 90, 55, 32, 45, 63]

This output tells us a few things. Lowercase letters come after uppercase letters: 'a' is 97, while 'A' is 65. (Note that it doesn't matter if I use single quotes ' or double quotes ".) Digits are before any letters, the space character comes before digits, and, indeed, '-' comes before '?'.

65 and 97 might seem a bit arbitrary, but there's a pattern. Look at the binary representation for lower and uppercase "z":

In [6]:

"{:08b} {:08b}".format(122, 90)

Out[6]:

'01111010 01011010'

Remember that the binary representation of 26 is 0b11010. Now look at the last five bits of both strings. We've just copied Bacon's five-bit representation! For uppercase letters we add the prefix 010, and for lowercase 011. That's all it is.

To go the other way and get the character representation of a number, we use the chr function.

In [7]:

[chr(97), chr(32), chr(65)]

Out[7]:

['a', ' ', 'A']

Encodings¶

ASCII is a good standard for English, but what if your name is Jürgen and you live in Århus? One option is to map non-ASCII characters to ASCII. You could decide to be Juergen and live in Aarhus. But writing systems carry a lot of cultural weight, and people have strong opinions about them. We needed to do better.

Remember that ASCII is a seven-bit system, and we're using 8-bit bytes. That means we can fit 128 more characters into a byte! That's enough to add most of the characters used in Western European languages. This extension of ASCII forms a new character encoding. The most commonly used Western European encoding is called Latin-1 or the bureaucratically named ISO 8859-1 standard.

In python we can create a string and ask for its representation in bytes using the encode function. We need to specify an encoding standard to create this mapping:

In [8]:

"être".encode("ISO-8859-1")

Out[8]:

b'\xeatre'

This expression returns a byte array. The way python displays byte arrays is confusing, and it's worth unpacking it.

The b prefix indicates that what follows is a sequence of bytes. The three bytes at the end map to ASCII letters "tre" and so Python helpfully displays them with ASCII. The first letter, the e with circumflex, is mapped by the Latin-1 standard to the number 234, which isn't a valid ASCII letter, so Python displays it as a two-digit base-16 number, 0xea or 234 (e is 14, a is 10, 234 = 14*16 + 10). If you go to the bottom of the Wikipedia page for Latin-1 you can see a table showing how this encoding maps numbers to characters. Look in the row for e and the column for a, and you should see ê.

The \x is a prefix indicating that the next two letters ea represent an 8-bit number. Python uses \x as a prefix within strings, and 0x as a prefix for numbers outside of strings.

To make this display a little easier, we can define a helper function that will print bytes as numbers.

In [18]:

def display_bytes(byte_array, format_string="{:d}"):
    output = [] # list to put string representations in
    for byte in byte_array:
        # convert each byte to a formatted string and save it
        output.append(format_string.format(byte))  
    print(" ".join(output))

# display each byte as a decimal number
display_bytes("être".encode("ISO-8859-1"))

234 116 114 101

In [17]:

# the same thing but with binary and hex
display_bytes("être".encode("ISO-8859-1"), "{:08b}")
display_bytes("être".encode("ISO-8859-1"), "{:x}")

11101010 01110100 01110010 01100101
ea 74 72 65

But what if you happen to live in Greece? You could do the same trick: use ASCII for the first 128 characters, and put Greek letters in the remaining slots. One encoding that does just that is ISO 8859-7. To the computer, a file is just a sequence of bytes. For a system to work, the program that reads the file has to know which encoding to use to read a file. And this is where almost all encoding problems occur: when the encoding we are using to convert numbers into characters is not the same as the encoding someone used to convert characters into numbers. Here's what happens when we read French as Greek:

In [9]:

# convert to bytes, then convert bytes to letters with the wrong encoding
"être".encode("ISO-8859-1").decode("ISO-8859-7")

Out[9]:

'κtre'

The three letters "tre" are, again, fine, because both encodings inherit the same ASCII 0-127 range. But the French e-circumflex, which was written out as 234, is now a Greek letter kappa, because that is how the Greek encoding interprets the number 234 (look at the e row and the a column in the Wikipedia table).

This system was, and remains, a disaster. The potential confusion between different writing systems is bad enough. But vendors also couldn't agree on standards: moving a file from a Mac to a Windows PC could ruin a file, or at least sprinkle it with bizarre symbols (MacRoman and Windows-1252 are close to Latin-1, but noticeably different). One of the worst culprits is advanced punctuation, such as "smart quotes". Since these were introduced relatively late after word processing programs became common they didn't have standard encodings. The switch to the Euro led to many new problems when a completely new symbo If you don't believe me that this got out of hand, look at the table of links to code pages at the bottom of the Wikipedia page for the ISO 8859-1 standard.

But more importantly the "staple your characters onto ASCII" trick only works well for writing systems of the Phoenecian alphabetic type. The syllabic and ideographic writing systems of South and East Asia have massive use but don't fit in 128 or even 256 character slots. Many, many other writing systems had no representation, and therefore no ability to join the digital world.

Unicode¶

The solution to these problems was to create a single mapping from all known characters to numbers, called Unicode. Unicode is in fact the default in Python 3, which is one of the main reasons you should be using it over Python 2.

In [10]:

ord("你")

Out[10]:

Unicode combines a large number of legacy encodings as 128-character "codepages". For example, the first codepage is ASCII. The second is the "upper half" of Latin-1 (characters 128-255), followed by the upper half of the ISO Greek, Arabic, Cyrillic, and Hebrew encodings. Although the Unicode consortium continues to include existing writing systems, some of the largest expansion has gone into emojis, which have standard Unicode values.

In [11]:

ord("🤯")

Out[11]:

Having a standard mapping from characters to numbers is great, but creates a new problem. Almost all commonly used writing systems can be expressed with fewer than 65000 characters, which can be represented using two bytes (2¹⁶ = 65536). But the exploding head emoji is number 129327. Do we need to reserve three or even four bytes for every character? This problem leads to the last major difficulty in modern text encoding.

Unicode encodings (UTF-8)¶

You've probably seen the UTF-8 encoding when saving or loading files, and you might know that it has something to do with Unicode. It's really just a way of writing numbers using the fewest possible bits. We write smaller numbers with just one byte, and larger numbers with up to four.

So far we have assumed that the number we associate with a character and the sequence of bits we save to a file are the same. ASCII defines 'a' as 97, and so when we want to write that character to an ASCII file we write the byte 97. A simple way to write Unicode characters to a file is called UTF-16. For each character we look up its numeric value and then write two bytes that represent that number:

In [12]:

["a".encode("UTF-16"), "你".encode("UTF-16")]

Out[12]:

[b'\xff\xfea\x00', b'\xff\xfe`O']

In [10]:

2**16

Out[10]:

Both byte sequences have a prefix \xff\xfe called a byte-order mark (BOM). The BOM indicates which order we should read the two bytes for each character. In this case \xff\xfe means we should treat the first byte of each pair as the low byte, and the second as the high byte.

For the a, the encoded byte sequence is 97 (hex 0x61, ASCII a) followed by \x00, or the 0 byte. Using the order specified by the BOM, these bytes are equivalent to 0x0061. For the Chinese character whose value is 20320 = 79*256 + 96, the two bytes are the backtick ` (ASCII 96) and O (ASCII 79). Again, the BOM tells us to reverse the order of the bytes when we stitch them together into a single number.

UTF-16 isn't great. If text is mostly in the ASCII range we're wasting half our bits storing zeros. There's also no way to write the emoji since we can only go up to 2¹⁶ = 65536. There is a UTF-32 standard that uses four bytes, but unless our text is almost entirely emoji we would be wasting at least half our bytes for most characters.

But the number assigned to the character and the bytes we write to disk don't have to be the same. Unicode is really just a way of giving each character a number, but there's no reason we need to write that number out completely. To distinguish the numeric Unicode reprsentation of the character from the bytes we write to disk, we use the term "codepoint" to mean that abstract numeric Unicode value. The UTF-8 encoding writes different numbers of bytes for each character depending on its Unicode codepoint value.

UTF-8 uses one to four bytes to represent each codepoint, depending on its magnitude. It treats the highest bit in a byte as a "continuation" indicator. If that bit is zero, meaning that the number represented by that byte is less than 128, that byte represents a single character, and all additional bits are interpreted as the number for that character.

For a Unicode codepoint below 128, like a, we write one byte.

In [19]:

display_bytes("a".encode("UTF-8"), "{:08b}")

01100001

Now look at the UTF-8 representation for á, which is Unicode codepoint 225.

In [20]:

display_bytes("á".encode("UTF-8"), "{:08b}")

11000011 10100001

Decimal 225 is 0b11100001 in binary. It's written out as two bytes in UTF-8. Both bytes start with 1. The first starts with 110, which indicates a two-byte character (1110 indicates three bytes, 11110 indicates four). The second starts with 10, which indicates that this byte is a continuation byte. If we drop those two prefixes and combine the remaining bits, we are left with 0b00011100001. With some zeros at the start, this is exactly what we expected! See the Wikipedia UTF-8 page for more details of the encoding.

The result is that characters with higher Unicode codepoints will require more bytes. Thus an ASCII a comes out as one byte, an accented Latin character has two, a Chinese character has three, and an emoji has four.

In [23]:

display_bytes("a".encode("UTF-8"))
display_bytes("á".encode("UTF-8"))
display_bytes("你".encode("UTF-8"))
display_bytes("🤯".encode("UTF-8"))

97
195 161
228 189 160
240 159 164 175

The fact that codepoints below 128 are written in one byte means that text in the original ASCII 0-127 range can be read as valid UTF-8.

In [13]:

## encode the string as ASCII, then interpret it as UTF-8
"This, indeed, is ASCII!".encode("ascii").decode("UTF-8")

Out[13]:

'This, indeed, is ASCII!'

In [1]:

## encode the string as UTF-8, then interpret it as ASCII
"This, indeed, is ASCII!".encode("UTF-8").decode("ascii")

Out[1]:

'This, indeed, is ASCII!'

We cannot swap ASCII and UTF-8 for these characters!

In [7]:

try:
    "你".encode("ascii")
except UnicodeEncodeError as err:
    print(err)

'ascii' codec can't encode character '\u4f60' in position 0: ordinal not in range(128)

In [9]:

## The first byte of the UTF-8 encoding is e4, which is > 128, and thus not valid ASCII.

try:
    "你".encode("UTF-8").decode("ascii")
except UnicodeDecodeError as err:
    print(err)

'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

One of the most common problems we encounter is when text has been encoded in UTF-8 but then decoded in a one-byte legacy encoding. Seeing examples of this can help us recognize problems when we see them in text. Consider this phrase from the French Wikipedia:

In [25]:

"Alexandria Ocasio-Cortez avait déjà un astéroïde à son nom".encode("UTF-8").decode("Latin-1")

Out[25]:

'Alexandria Ocasio-Cortez avait dÃ©jÃ\xa0 un astÃ©roÃ¯de Ã\xa0 son nom'

Each of the non-ASCII characters has been replaced by two strange-looking characters. This "doubling" effect is a good indication of misread UTF-8. The other giveaway is that each of those letter pairs has the same first letter: Ã. Remember that in the á example the UTF-8 encoding produced two bytes, where the first byte included the high bits of the binary representation of the codepoint 225. Characters that are similar to each other, such as Latin characters with French accents, will have similar codepoints, which will have similar high bits. Here's another example in Russian (from the nice looking UTF-8 page). Note that the repeated letter is now Ð.

In [26]:

"Алгоритм кодирования".encode("UTF-8").decode("Latin-1")

Out[26]:

'Ð\x90Ð»Ð³Ð¾Ñ\x80Ð¸Ñ\x82Ð¼ ÐºÐ¾Ð´Ð¸Ñ\x80Ð¾Ð²Ð°Ð½Ð¸Ñ\x8f'

Glyphs and fonts¶

The last step in this process is the conversion of encodings of letters into small images that appear on a screen: the glyphs. With a pen and paper (or brush and papyrus), the connection between the idea of a letter and a visual representation is direct and immediate. I think of "a" and I draw a glyph that I hope my readers will interpret as an "a". Medieval manuscripts are difficult to read because scribes write glyphs in many different ways, including ligatures that represent more than one letter. Glyphs can also represent multiple letters. Many old typewriters did not have a 1 key, for example, because the glyph for the number "1" and the letter "l" were the same. It didn't matter that they were conceptually different, because the image struck on the paper would be the same.

In computers, the glyph ultimately seen by a user is determined by the font, which is a mapping from codepoints to images. If a font does not have a glyph for a codepoint, it may display a question mark or an empty box. It may also substitute a glyph from another font, which may cause accented characters to be slightly larger or smaller than unaccented characters.

In [33]:

print("\U00012010")

𒀐