#!/usr/bin/env python # coding: utf-8 # This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this: # # 1. If `English is the only language on the planet`, then we don't need the concept of Unicode and UTF-8. ASCII would be enough. # # # 2. However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode. # # # 3. As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so. # # # 4. Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters? # # # 5. If more than 1 byte is used to represent a character, then all the bytes need to be `packed` (think of a box) as one unit. This "boxing" method is called UTF-8. # ### UTF-8 Format # | Number of bytes | Bits for code point (empty spaces) | Byte 1| Byte 2| Byte 3| Byte 4| # |------|------|---|---|---|---| # | 1 | 7 |0xxxxxxx| # | 2 | 11|110xxxxx|10xxxxxx|| # | 3 | 16|1110xxxx|10xxxxxx|10xxxxxx|| # | 4 | 21|11110xxx|10xxxxxx|10xxxxxx|10xxxxxx| # As you can see above, the `x` represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers. # ### Example Time # A chinese character: 汉 # 1. Find the Unicode value of this character in hexdecimal format # In[45]: hex(ord('汉')) # 2. Humans do not think in hexdecimals, so we want to see the unicode value in decimal # In[8]: ord('汉') # 3. However, computers can only store this character in binaries: # In[18]: f'{ord("汉"):016b}' # 16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need. # In[19]: len(f'{ord("汉"):016b}') # 4. so let's pack (encode) this character using the UTF-8 format # In[83]: f'{int("汉".encode("utf-8").hex(), 16):b}' # |Byte 1| Byte 2| Byte 3| # |---|---|---| # |1110 0110|10110001|10001001| # 5. Done! we can write the above binaries onto a hard drive now (notice how you can save a document in UTF-8 format in almost all text editors). When a computer reads this string, either you need to tell the text editor to read as UT8-8 or it will automatically to do so (default preference)