#!/usr/bin/env python
# coding: utf-8
# This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this:
#
# 1. If `English is the only language on the planet`, then we don't need the concept of Unicode and UTF-8. ASCII would be enough.
#
#
# 2. However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode.
#
#
# 3. As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so.
#
#
# 4. Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters?
#
#
# 5. If more than 1 byte is used to represent a character, then all the bytes need to be `packed` (think of a box) as one unit. This "boxing" method is called UTF-8.
# ### UTF-8 Format
# | Number of bytes | Bits for code point (empty spaces) | Byte 1| Byte 2| Byte 3| Byte 4|
# |------|------|---|---|---|---|
# | 1 | 7 |0xxxxxxx|
# | 2 | 11|110xxxxx|10xxxxxx||
# | 3 | 16|1110xxxx|10xxxxxx|10xxxxxx||
# | 4 | 21|11110xxx|10xxxxxx|10xxxxxx|10xxxxxx|
# As you can see above, the `x` represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers.
# ### Example Time
# A chinese character: 汉
# 1. Find the Unicode value of this character in hexdecimal format
# In[45]:
hex(ord('汉'))
# 2. Humans do not think in hexdecimals, so we want to see the unicode value in decimal
# In[8]:
ord('汉')
# 3. However, computers can only store this character in binaries:
# In[18]:
f'{ord("汉"):016b}'
# 16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need.
# In[19]:
len(f'{ord("汉"):016b}')
# 4. so let's pack (encode) this character using the UTF-8 format
# In[83]:
f'{int("汉".encode("utf-8").hex(), 16):b}'
# |Byte 1| Byte 2| Byte 3|
# |---|---|---|
# |1110 0110|10110001|10001001|
# 5. Done! we can write the above binaries onto a hard drive now (notice how you can save a document in UTF-8 format in almost all text editors). When a computer reads this string, either you need to tell the text editor to read as UT8-8 or it will automatically to do so (default preference)