This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this:
If English is the only language on the planet
, then we don't need the concept of Unicode and UTF-8. ASCII would be enough.
However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode.
As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so.
Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters?
If more than 1 byte is used to represent a character, then all the bytes need to be packed
(think of a box) as one unit. This "boxing" method is called UTF-8.
Number of bytes | Bits for code point (empty spaces) | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
1 | 7 | 0xxxxxxx | |||
2 | 11 | 110xxxxx | 10xxxxxx | ||
3 | 16 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
As you can see above, the x
represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers.
A chinese character: 汉
hex(ord('汉'))
'0x6c49'
ord('汉')
27721
f'{ord("汉"):016b}'
'0110110001001001'
16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need.
len(f'{ord("汉"):016b}')
16
f'{int("汉".encode("utf-8").hex(), 16):b}'
'111001101011000110001001'
Byte 1 | Byte 2 | Byte 3 |
---|---|---|
1110 0110 | 10110001 | 10001001 |