This topic might be confusing due to the fact that the concept of unicode, UTF-8, hexdecimals and binaries are all mixed together. To clarify this topic, I am going to start with this:

If English is the only language on the planet, then we don't need the concept of Unicode and UTF-8. ASCII would be enough.
However, since that is not the case, we need to go way beyound 256 symbols (the ASCII table) to hold everything. This bigger table that holds almost everything is called Unicode.
As the table gets bigger, 1 byte (8 bits) is not enough to hold all the information. It turns out that 4 bytes are need to do so.
Think about this for a second: When a computer reads four bytes, how does it know if it represents 1, 2, 3 or 4 characters?
If more than 1 byte is used to represent a character, then all the bytes need to be packed (think of a box) as one unit. This "boxing" method is called UTF-8.

UTF-8 Format¶

Number of bytes	Bits for code point (empty spaces)	Byte 1	Byte 2	Byte 3	Byte 4
1	7	0xxxxxxx
2	11	110xxxxx	10xxxxxx
3	16	1110xxxx	10xxxxxx	10xxxxxx
4	21	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

As you can see above, the x represents the number of bits you can use for storing a character. Think of the 0s and 1s as headers.

Example Time¶

A chinese character: 汉

Find the Unicode value of this character in hexdecimal format

In [45]:

hex(ord('汉'))

Out[45]:

'0x6c49'

Humans do not think in hexdecimals, so we want to see the unicode value in decimal

In [8]:

ord('汉') 

Out[8]:

However, computers can only store this character in binaries:

In [18]:

f'{ord("汉"):016b}'

Out[18]:

'0110110001001001'

16 bits are need for packing this character. According to the UTF-8 format table above, 3 bytes (16 empty spaces) are need.

In [19]:

len(f'{ord("汉"):016b}')

Out[19]:

so let's pack (encode) this character using the UTF-8 format

In [83]:

f'{int("汉".encode("utf-8").hex(), 16):b}'

Out[83]:

'111001101011000110001001'

Byte 1	Byte 2	Byte 3
1110 0110	10110001	10001001

Done! we can write the above binaries onto a hard drive now (notice how you can save a document in UTF-8 format in almost all text editors). When a computer reads this string, either you need to tell the text editor to read as UT8-8 or it will automatically to do so (default preference)