import unicodedata
Python has two types of "strings": byte strings and unicode strings.
In fact, byte strings are used to represent multiple different kinds of data:
These distinctions are not marked in the type.
In contrast, Unicode strings are always just strings.
Syntax to note is:
"..."
u"..."
\x10
- byte 16 (hexadecimal)\10
- byte 8 (octal)\t
\n
\r
etc. - special characters\uxxxx
16 bit unicode character\Uxxxxxxxx
32 bit unicode character"\10"
'\\d10'
type("Hello, World")
str
type(u"Hallo, wie gähhhtsch?")
unicode
The functions ord
and unichr
convert individual characters.
unichr(77)
u'M'
ord(u"い")
12356
We refer to the number (integer) of a Unicode character as its codepoint.
Unicode is really primarily an assignment of codepoints to characters and their properties.
(The second important part of Unicode is encodings; we look at those below.)
The function unicode
converts a string to a unicode string.
unicode("abc")
u'abc'
You can use str
to convert from Unicode to a string, but it won't work for strings that can't be represented in ASCII. You really need a codec (coder-decoder); see below.
str(u"abc")
'abc'
str(u"äbc")
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-130-3d6f994274e9> in <module>() ----> 1 str(u"äbc") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
Note that there is a diference between displaying output and printing it; the former uses
the repr
method, that tries to represent code in an ASCII/source friendly way independent
of encoding.
print repr(unichr(0x200))
unichr(0x200)
u'\u0200'
u'\u0200'
print unichr(0x200)
Ȁ
Let's get an impression of what characters exist in Unicode.
for i in range(0x0100,0x4000,256):
print "%4x"%i,"".join([unichr(j) for j in range(i,i+256,7)])
100 ĀćĎĕĜģĪıĸĿņōŔśŢũŰŷžƅƌƓƚơƨƯƶƽDŽNjǒǙǠǧǮǵǼ 200 ȀȇȎȕȜȣȪȱȸȿɆɍɔɛɢɩɰɷɾʅʌʓʚʡʨʯʶʽ˄ˋ˒˙ˠ˧ˮ˵˼ 300 ̸̜̣̪̱͍͔̀̇̎̿͆͛ͩ̕͢Ͱͷ;΅ΌΓΚΡΨίζντϋϒϙϠϧϮϵϼ 400 ЀЇЎЕМУЪбипцэєћѢѩѰѷѾ҅ҌғҚҡҨүҶҽӄӋӒәӠӧӮӵӼ 500 ԀԇԎԕԜԣԪԱԸԿՆՍՔ՛բթհշվօֶֽ֚֓֡֨֯ׄגינק 600 ؇؎ؕأترظؿنٍٔٛ٢٩ٰٷپڅڌړښڡڨگڶڽۄۋےۙ۠ۧۮ۵ۼ 700 ܀܇ܕܜܣܪܱܸ݆ܿݍݔݛݢݩݰݷݾޅތޓޚޡިޯ߄ߋߒߙߠߧ߮ߵ 800 ࠀࠇࠎࠕࠜࠣࠪ࠱࠸ࡆࡍࡔ࡛ࡢࡩࡰࡷࡾࢅࢌ࢚ࢡࢨࢯࢶࢽࣄ࣒࣮࣋ࣙ࣠ࣧࣵࣼ 900 ऀइऎकजणपऱसिॆ्॔ज़ॢ३॰ॷॾঅঌওচডনযশঽৄোৠ১৮৵ৼ a00 ਇਕਜਣਪਸਿ੍ਜ਼੩ੰઅઌઓચડનયશઽૄોૠ૧૮ૼ b00 ଇକଜଣପସି୍ୢ୩୰୷அஓசநயஶோ௧௮௵ c00 ఀఇఎకజణపఱసిె్ౢ౩౷౾ಅಌಓಚಡನಯಶಽೄೋೠ೧೮ d00 ഀഇഎകജണപറസിെ്ൔ൛ൢ൩൰൷ൾඅඌඓකඡඨදබලහිෙ෧෮ e00 งฎตผรสัุ฿ๆํ๔๛ຌຓບມຨຯຶຽໄ໋໒໙ f00 ༀ༇༎༕༜༣༪༱༸༿ཆཌྷཔཛརཀྵཷཾ྅ྌྒྷྚྡྨྯྶ࿄࿋࿒࿙ 1000 ကဇဎပလဣဪေးဿ၆၍ၔၛၢၩၰၷၾႅႌ႓ႚႡႨႯႶႽჄგკრყხჵჼ 1100 ᄀᄇᄎᄕᄜᄣᄪᄱᄸᄿᅆᅍᅔᅛᅢᅩᅰᅷᅾᆅᆌᆓᆚᆡᆨᆯᆶᆽᇄᇋᇒᇙᇠᇧᇮᇵᇼ 1200 ሀሇሎሕሜሣሪሱሸሿቆቍቔቛቢቩተቷቾኅኌናኚኡከኯኽዄዋዒዙዠዧዮድዼ 1300 ጀጇጎጕጜጣጪጱጸጿፆፍፔ።፩፰፷ᎅᎌ᎓ᎡᎨᎯᎶᎽᏄᏋᏒᏙᏠᏧᏮᏵᏼ 1400 ᐀ᐇᐎᐕᐜᐣᐪᐱᐸᐿᑆᑍᑔᑛᑢᑩᑰᑷᑾᒅᒌᒓᒚᒡᒨᒯᒶᒽᓄᓋᓒᓙᓠᓧᓮᓵᓼ 1500 ᔀᔇᔎᔕᔜᔣᔪᔱᔸᔿᕆᕍᕔᕛᕢᕩᕰᕷᕾᖅᖌᖓᖚᖡᖨᖯᖶᖽᗄᗋᗒᗙᗠᗧᗮᗵᗼ 1600 ᘀᘇᘎᘕᘜᘣᘪᘱᘸᘿᙆᙍᙔᙛᙢᙩᙰᙷᙾᚅᚌᚓᚚᚡᚨᚯᚶᚽᛄᛋᛒᛙᛠᛧᛮᛵ 1700 ᜀᜇᜎ᜕ᜣᜪᜱᝆᝍᝢᝩᝰចឌនរឡឨឯាួោ់្៙០៧៵ 1800 ᠀᠇᠕ᠣᠪᠱᠸᠿᡆᡍᡔᡛᡢᡩᡰᡷᢅᢌᢓᢚᢡᢨᢶᢽᣄᣋᣒᣙᣠᣧᣮᣵ 1900 ᤀᤇᤎᤕᤜᤣᤪᤱᤸ᥆᥍ᥔᥛᥢᥩᥰᦅᦌᦓᦚᦡᦨᦶᦽᧄ᧒᧙᧠᧧᧮᧵᧼ 1a00 ᨀᨇᨎᨕᨣᨪᨱᨸᨿᩆᩍᩔᩛᩢᩩᩰ᩷᪅᪓᪡᪨᪶᪽᫄᫋ 1b00 ᬀᬇᬎᬕᬜᬣᬪᬱᬸᬿᭆ᭔᭛᭢᭩᭰᭷᭾ᮅᮌᮓᮚᮡᮨᮯ᮶ᮽᯄᯋᯒᯙᯠᯧᯮ᯼ 1c00 ᰀᰇᰎᰕᰜᰣᰪᰱ᰿᱆ᱍ᱔ᱛᱢᱩᱰᱷ᱾ᲅᲓᲚᲡᲨᲯᲶᲽ᳄᳧᳙᳒᳠ᳮᳵ 1d00 ᴀᴇᴎᴕᴜᴣᴪᴱᴸᴿᵆᵍᵔᵛᵢᵩᵰᵷᵾᶅᶌᶓᶚᶡᶨᶯᶶᶽ᷄᷋᷒ᷙᷠᷧᷮ᷵᷼ 1e00 ḀḇḎḕḜḣḪḱḸḿṆṍṔṛṢṩṰṷṾẅẌẓẚạẨắẶẽỄịỒộỠủỮỵỼ 1f00 ἀἇἎἕἜἣἪἱἸἿὍὔὛὢὩὰίᾅᾌᾓᾚᾡᾨᾯᾶ᾽ῄΉῒῙῠῧ΅ῼ 2000 ―“‣‱‸‿⁆⁍⁔⁛⁰⁷⁾₅₌ₓₚ₡₨₯₶₽⃒⃙⃠⃮⃧ 2100 ℀ℇℎℕℜ℣Kℱℸℿⅆ⅍⅔⅛ⅢⅩⅰⅷⅾↅ↓↚↡↨↯↶↽⇄⇋⇒⇙⇠⇧⇮⇵⇼ 2200 ∀∇∎∕∜∣∪∱∸∿≆≍≔≛≢≩≰≷≾⊅⊌⊓⊚⊡⊨⊯⊶⊽⋄⋋⋒⋙⋠⋧⋮⋵⋼ 2300 ⌀⌇⌎⌕⌜⌣〉⌱⌸⌿⍆⍍⍔⍛⍢⍩⍰⍷⍾⎅⎌⎓⎚⎡⎨⎯⎶⎽⏄⏋⏒⏙⏠⏧⏮⏵⏼ 2400 ␀␇␎␕␜␣⑆③⑩⑰⑷⑾⒅⒌⒓⒚⒡⒨⒯ⒶⒽⓄⓋⓒⓙⓠⓧ⓮⓵⓼ 2500 ─┇┎┕├┣┪┱┸┿╆╍╔╛╢╩╰╷╾▅▌▓▚□▨▯▶▽◄○◒◙◠◧◮◵◼ 2600 ☀☇☎☕☜☣☪☱☸☿♆♍♔♛♢♩♰♷♾⚅⚌⚓⚚⚡⚨⚯⚶⚽⛄⛋⛒⛙⛠⛧⛮⛵⛼ 2700 ✀✇✎✕✜✣✪✱✸✿❆❍❔❛❢❩❰❷❾➅➌➓➚➡➨➯➶➽⟄⟋⟒⟙⟠⟧⟮⟵⟼ 2800 ⠀⠇⠎⠕⠜⠣⠪⠱⠸⠿⡆⡍⡔⡛⡢⡩⡰⡷⡾⢅⢌⢓⢚⢡⢨⢯⢶⢽⣄⣋⣒⣙⣠⣧⣮⣵⣼ 2900 ⤀⤇⤎⤕⤜⤣⤪⤱⤸⤿⥆⥍⥔⥛⥢⥩⥰⥷⥾⦅⦌⦓⦚⦡⦨⦯⦶⦽⧄⧋⧒⧙⧠⧧⧮⧵⧼ 2a00 ⨀⨇⨎⨕⨜⨣⨪⨱⨸⨿⩆⩍⩔⩛⩢⩩⩰⩷⩾⪅⪌⪓⪚⪡⪨⪯⪶⪽⫄⫋⫒⫙⫠⫧⫮⫵⫼ 2b00 ⬀⬇⬎⬕⬜⬣⬪⬱⬸⬿⭆⭍⭔⭛⭢⭩⭰⭷⭾⮅⮌⮓⮚⮡⮨⮯⮶⮽⯄⯋⯒⯙⯠⯧⯮⯵⯼ 2c00 ⰀⰇⰎⰕⰜⰣⰪⰱⰸⰿⱆⱍⱔⱛⱢⱩⱰⱷⱾⲅⲌⲓⲚⲡⲨⲯⲶⲽⳄⳋⳒⳙⳠ⳧ⳮ⳼ 2d00 ⴀⴇⴎⴕⴜⴣⴱⴸⴿⵆⵍⵔⵛⵢ⵰ⶅⶌⶓⶡⶨⶶⶽⷄⷋⷒⷙⷠⷧⷮⷵⷼ 2e00 ⸀⸇⸎⸕⸜⸣⸪⸱⸸⸿⹆⹍⹔⹛⺅⺌⺓⺡⺨⺯⺶⺽⻄⻋⻒⻙⻠⻧⻮ 2f00 ⼀⼇⼎⼕⼜⼣⼪⼱⼸⼿⽆⽍⽔⽛⽢⽩⽰⽷⽾⾅⾌⾓⾚⾡⾨⾯⾶⾽⿄⿋⿒⿵ 3000 〇『〕〜〣〪〱〸〿うきごせぢどばぷまゅれん゚ァエクザソツニヒベムョヮヵー 3100 ㄇㄎㄕㄜㄣㄪㄱㄸㄿㅆㅍㅔㅛㅢㅩㅰㅷㅾㆅㆌ㆓㆚ㆡㆨㆯㆶㆽ㇄㇋㇒㇙㇠ㇵㇼ 3200 ㈀㈇㈎㈕㈜㈣㈪㈱㈸㈿㉆㉍㉔㉛㉢㉩㉰㉷㉾㊅㊌㊓㊚㊡㊨㊯㊶㊽㋄㋋㋒㋙㋠㋧㋮㋵㋼ 3300 ㌀㌇㌎㌕㌜㌣㌪㌱㌸㌿㍆㍍㍔㍛㍢㍩㍰㍷㍾㎅㎌㎓㎚㎡㎨㎯㎶㎽㏄㏋㏒㏙㏠㏧㏮㏵㏼ 3400 㐀㐇㐎㐕㐜㐣㐪㐱㐸㐿㑆㑍㑔㑛㑢㑩㑰㑷㑾㒅㒌㒓㒚㒡㒨㒯㒶㒽㓄㓋㓒㓙㓠㓧㓮㓵㓼 3500 㔀㔇㔎㔕㔜㔣㔪㔱㔸㔿㕆㕍㕔㕛㕢㕩㕰㕷㕾㖅㖌㖓㖚㖡㖨㖯㖶㖽㗄㗋㗒㗙㗠㗧㗮㗵㗼 3600 㘀㘇㘎㘕㘜㘣㘪㘱㘸㘿㙆㙍㙔㙛㙢㙩㙰㙷㙾㚅㚌㚓㚚㚡㚨㚯㚶㚽㛄㛋㛒㛙㛠㛧㛮㛵㛼 3700 㜀㜇㜎㜕㜜㜣㜪㜱㜸㜿㝆㝍㝔㝛㝢㝩㝰㝷㝾㞅㞌㞓㞚㞡㞨㞯㞶㞽㟄㟋㟒㟙㟠㟧㟮㟵㟼 3800 㠀㠇㠎㠕㠜㠣㠪㠱㠸㠿㡆㡍㡔㡛㡢㡩㡰㡷㡾㢅㢌㢓㢚㢡㢨㢯㢶㢽㣄㣋㣒㣙㣠㣧㣮㣵㣼 3900 㤀㤇㤎㤕㤜㤣㤪㤱㤸㤿㥆㥍㥔㥛㥢㥩㥰㥷㥾㦅㦌㦓㦚㦡㦨㦯㦶㦽㧄㧋㧒㧙㧠㧧㧮㧵㧼 3a00 㨀㨇㨎㨕㨜㨣㨪㨱㨸㨿㩆㩍㩔㩛㩢㩩㩰㩷㩾㪅㪌㪓㪚㪡㪨㪯㪶㪽㫄㫋㫒㫙㫠㫧㫮㫵㫼 3b00 㬀㬇㬎㬕㬜㬣㬪㬱㬸㬿㭆㭍㭔㭛㭢㭩㭰㭷㭾㮅㮌㮓㮚㮡㮨㮯㮶㮽㯄㯋㯒㯙㯠㯧㯮㯵㯼 3c00 㰀㰇㰎㰕㰜㰣㰪㰱㰸㰿㱆㱍㱔㱛㱢㱩㱰㱷㱾㲅㲌㲓㲚㲡㲨㲯㲶㲽㳄㳋㳒㳙㳠㳧㳮㳵㳼 3d00 㴀㴇㴎㴕㴜㴣㴪㴱㴸㴿㵆㵍㵔㵛㵢㵩㵰㵷㵾㶅㶌㶓㶚㶡㶨㶯㶶㶽㷄㷋㷒㷙㷠㷧㷮㷵㷼 3e00 㸀㸇㸎㸕㸜㸣㸪㸱㸸㸿㹆㹍㹔㹛㹢㹩㹰㹷㹾㺅㺌㺓㺚㺡㺨㺯㺶㺽㻄㻋㻒㻙㻠㻧㻮㻵㻼 3f00 㼀㼇㼎㼕㼜㼣㼪㼱㼸㼿㽆㽍㽔㽛㽢㽩㽰㽷㽾㾅㾌㾓㾚㾡㾨㾯㾶㾽㿄㿋㿒㿙㿠㿧㿮㿵㿼
There are three kind of "strings" you have to think about:
Important property:
The UTF-8, ASCII, and Unicode are all "the same" if all the characters are in the ASCII character set.
In Python, encoding and decoding is performed via the encode
and decode
methods.
They take a codec name as an argument (ascii
or utf-8
are the only ones that are
relevant to us), plus an optional argument saying what should happen if a string
is not de/encodable.
u"abc".encode("ascii")
'abc'
u"äbc".encode("ascii","replace")
'?bc'
Let's look at a non-ASCII character. As you can see here, the German umlaut "ä" turns into a two character sequence when encoded in UTF-8. Each character has the high bit set. You can look up the exact encoding scheme online.
u"äbc".encode("utf-8")
'\xc3\xa4bc'
Some more unusual characters are encoded as four byte sequences in UTF-8.
u"𝍢".encode("utf-8")
'\xf0\x9d\x8d\xa2'
Note that although four bytes (i.e. 32 bits) are used for encoding this character, its codepoint is only 119650. That's because only a few bits are used from each 8 bit code.
ord(u"𝍢")
119650
Of course, we can also decode.
print '\xc3\xa4bc'.decode("utf-8")
äbc
You get an error message if the decoding is not possible.
print '\xc3\xa4bc'.decode("ascii")
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-87-d1315b67a57b> in <module>() ----> 1 print '\xc3\xa4bc'.decode("ascii") UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Standard file descriptors in Python cannot encode/decode UTF-8.
with open("temp","w") as stream: stream.write(u"Käse und Brot")
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-73-3a350ad9897f> in <module>() ----> 1 with open("temp","w") as stream: stream.write(u"Käse und Brot") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
To read and write Unicode, use the codecs.open
function.
It returns a standard file object, but it does the right kind of
encoding/decoding for UTF-8.
import codecs
with codecs.open("temp","w","utf-8") as stream: stream.write(u"Käse und Brot")
with codecs.open("temp","r","utf-8") as stream: print stream.read()
Käse und Brot
!cat -v temp
KM-CM-$se und Brot
The Unicode consortium defines a lot of information associated with each codepoint.
In Python, you can query this information using the unicodedata
library.
import unicodedata
Information about each codepoint includes:
unicodedata.name(u"ä")
'LATIN SMALL LETTER A WITH DIAERESIS'
unicodedata.category(u"ä")
'Ll'
ord(u"ä")
228
unicodedata.decimal(u"3")
3
unicodedata.numeric(u"四")
4.0
unicodedata.category(u"ß")
'Ll'
Various letters are really just combined forms of separate parts.
For example, the letter "ä" can be viewed as a combination of the letter "a" with the diacritic " ̈".
The Unicode consortium hasn't been consistent about how to represent these, so the same letter as it appears on the screen can be represented in two different ways.
print u'\u00e4'
print u'a\u0308'
ä ä
Although these strings look the same, they are represented differently.
u'\u00e4'==u'a\u0308'
False
Unicodedata can decompose characters.
unicodedata.decomposition(u"ä")
'0061 0308'
More generally, it can normalize a string into one of four forms:
What does this mean?
For example, "ff" as a ligature is compatible with the two letters "ff", but not canonically equivalent (since they look different).
Yes, unfortunately, you do need to worry about this.
for n in ["NFC","NFKC","NFD","NFKD"]:
s = unicodedata.normalize(n,u"ä")
print n,repr(s),s
NFC u'\xe4' ä NFKC u'\xe4' ä NFD u'a\u0308' ä NFKD u'a\u0308' ä
print u"r\u0308"
r̈
print u"+\u0308"
+̈
print u"\u0308"
̈
print u" \u0308"
̈
Many languages have ligatures. In some languages and scripts (e.g., German), ligatures like "ä" and "ß" have become letters in their own right. In other scripts, ligatures are just different presentations depending on the context a character appears in; in those cases, Unicode does not represent ligatures as separate code points.
Here is an example in Arabic. Note how the string looks very different when printed a character at a time vs. when printed as a word (also note that Arabic is a right-to-left language):
s = u"كتاب"
print s
كتاب
for c in s: print c,
print
ك ت ا ب
In contrast, the "ffi" ligature in English has its own Unicode codepoint.
s = u"a\ufb03ne"
print s
affine
for c in s: print c,
print
a ffi n e
s==u"affine"
False
unicodedata.normalize("NFKD",s)
u'affine'
unicodedata.normalize("NFKD",s)==unicodedata.normalize("NFKD",u"affine")
True