This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this:

  1. If English is the only language on the planet, then we don't need the concept of Unicode and UTF-8. ASCII would be enough.
  1. However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode.
  1. As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so.
  1. Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters?
  1. If more than 1 byte is used to represent a character, then all the bytes need to be packed (think of a box) as one unit. This "boxing" method is called UTF-8.

UTF-8 Format

Number of bytes Bits for code point (empty spaces) Byte 1 Byte 2 Byte 3 Byte 4
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As you can see above, the x represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers.

Example Time

A chinese character: 汉

  1. Find the Unicode value of this character in hexdecimal format
In [45]:
  1. Humans do not think in hexdecimals, so we want to see the unicode value in decimal
In [8]:
  1. However, computers can only store this character in binaries:
In [18]:

16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need.

In [19]:
  1. so let's pack (encode) this character using the UTF-8 format
In [83]:
f'{int("汉".encode("utf-8").hex(), 16):b}'
Byte 1 Byte 2 Byte 3
1110 0110 10110001 10001001
  1. Done! we can write the above binaries onto a hard drive now (notice how you can save a document in UTF-8 format in almost all text editors). When a computer reads this string, either you need to tell the text editor to read as UT8-8 or it will automatically to do so (default preference)