PyThaiNLP Get Started¶

Code examples for basic functions in PyThaiNLP https://github.com/PyThaiNLP/pythainlp

In [1]:

# # pip install required modules
# # uncomment if running from colab
# # see list of modules in `requirements` and `extras`
# # in https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py

#!pip install pythainlp
#!pip install epitran

Import PyThaiNLP¶

In [2]:

import pythainlp

pythainlp.__version__

Out[2]:

'2.2.1'

Thai Characters¶

PyThaiNLP provides some ready-to-use Thai character set (e.g. Thai consonants, vowels, tonemarks, symbols) as a string for convenience. There are also few utility functions to test if a string is in Thai or not.

In [3]:

pythainlp.thai_characters

Out[3]:

'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮฤฦะัาำิีึืุูเแโใไๅํ็่้๊๋ฯฺๆ์ํ๎๏๚๛๐๑๒๓๔๕๖๗๘๙฿'

In [4]:

len(pythainlp.thai_characters)

Out[4]:

In [5]:

pythainlp.thai_consonants

Out[5]:

'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ'

In [6]:

len(pythainlp.thai_consonants)

Out[6]:

In [7]:

"๔" in pythainlp.thai_digits  # check if Thai digit "4" is in the character set

Out[7]:

True

Checking if a string contains Thai character or not, or how many¶

In [8]:

import pythainlp.util

pythainlp.util.isthai("ก")

Out[8]:

True

In [9]:

pythainlp.util.isthai("(ก.พ.)")

Out[9]:

False

In [10]:

pythainlp.util.isthai("(ก.พ.)", ignore_chars=".()")

Out[10]:

True

counthai() returns proportion of Thai characters in the text. It will ignore non-alphabets by default.

In [11]:

pythainlp.util.countthai("วันอาทิตย์ที่ 24 มีนาคม 2562")

Out[11]:

100.0

You can specify characters to be ignored, using ignore_chars= parameter.

In [12]:

pythainlp.util.countthai("วันอาทิตย์ที่ 24 มีนาคม 2562", ignore_chars="")

Out[12]:

67.85714285714286

Collation¶

Sorting according to Thai dictionary.

In [13]:

from pythainlp.util import collate

thai_words = ["ค้อน", "กระดาษ", "กรรไกร", "ไข่", "ผ้าไหม"]
collate(thai_words)

Out[13]:

['กรรไกร', 'กระดาษ', 'ไข่', 'ค้อน', 'ผ้าไหม']

In [14]:

collate(thai_words, reverse=True)

Out[14]:

['ผ้าไหม', 'ค้อน', 'ไข่', 'กระดาษ', 'กรรไกร']

Date/Time Format and Spellout¶

Date/Time Format¶

Get Thai day and month names with Thai Buddhist Era (B.E.). Use formatting directives similar to datetime.strftime().

In [15]:

import datetime
from pythainlp.util import thai_strftime

fmt = "%Aที่ %-d %B พ.ศ. %Y เวลา %H:%M น. (%a %d-%b-%y)"
date = datetime.datetime(1976, 10, 6, 1, 40)

thai_strftime(date, fmt)

Out[15]:

'วันพุธที่ 6 ตุลาคม พ.ศ. 2519 เวลา 01:40 น. (พ 06-ต.ค.-19)'

From version 2.2, these modifiers can be applied right before the main directive:

- (minus) Do not pad a numeric result string (also available in version 2.1)
_ (underscore) Pad a numeric result string with spaces
0 (zero) Pad a number result string with zeros
^ Convert alphabetic characters in result string to upper case
# Swap the case of the result string
O (letter o) Use the locale's alternative numeric symbols (Thai digit)

In [16]:

thai_strftime(date, "%d %b %y")

Out[16]:

'06 ต.ค. 19'

In [17]:

thai_strftime(date, "%d %b %Y")

Out[17]:

'06 ต.ค. 2519'

Time Spellout¶

Note: thai_time() will be renamed to time_to_thaiword() in version 2.2.

In [18]:

from pythainlp.util import thai_time

thai_time("00:14:29")

Out[18]:

'ศูนย์นาฬิกาสิบสี่นาทียี่สิบเก้าวินาที'

The way to spellout can be chosen, using fmt parameter. It can be 24h, 6h, or m6h. Try one by yourself.

In [19]:

thai_time("00:14:29", fmt="6h")

Out[19]:

'เที่ยงคืนสิบสี่นาทียี่สิบเก้าวินาที'

Precision of spellout can be chosen as well. Using precision parameter. It can be m for minute-level, s for second-level, or None for only read the non-zero value.

In [20]:

thai_time("00:14:29", precision="m")

Out[20]:

'ศูนย์นาฬิกาสิบสี่นาที'

In [21]:

print(thai_time("8:17:00", fmt="6h"))
print(thai_time("8:17:00", fmt="m6h", precision="s"))
print(thai_time("18:30:01", fmt="m6h", precision="m"))
print(thai_time("13:30:01", fmt="6h", precision="m"))

สองโมงเช้าสิบเจ็ดนาที
แปดโมงสิบเจ็ดนาทีศูนย์วินาที
หกโมงครึ่ง
บ่ายโมงครึ่ง

We can also pass datetime and time objects to thai_time().

In [22]:

import datetime

time = datetime.time(13, 14, 15)
thai_time(time)

Out[22]:

'สิบสามนาฬิกาสิบสี่นาทีสิบห้าวินาที'

In [23]:

time = datetime.datetime(10, 11, 12, 13, 14, 15)
thai_time(time, fmt="6h", precision="m")

Out[23]:

'บ่ายโมงสิบสี่นาที'

Tokenization and Segmentation¶

At sentence, word, and sub-word levels.

Sentence¶

Default sentence tokenizer is "crfcut". Tokenization engine can be chosen ussing engine= parameter.

In [24]:

from pythainlp import sent_tokenize

text = ("พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ "
        "เป็นรัฐธรรมนูญฉบับชั่วคราว ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม "
        "ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 "
        "โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร")

print("default (crfcut):")
print(sent_tokenize(text))
print("\nwhitespace+newline:")
print(sent_tokenize(text, engine="whitespace+newline"))

default (crfcut):
['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว พุทธศักราช ๒๔๗๕ เป็นรัฐธรรมนูญฉบับชั่วคราว ', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม ', 'ประกาศใช้เมื่อวันที่ 27 มิถุนายน พ.ศ. 2475 ', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่ 24 มิถุนายน พ.ศ. 2475 โดยคณะราษฎร']

whitespace+newline:
['พระราชบัญญัติธรรมนูญการปกครองแผ่นดินสยามชั่วคราว', 'พุทธศักราช', '๒๔๗๕', 'เป็นรัฐธรรมนูญฉบับชั่วคราว', 'ซึ่งถือว่าเป็นรัฐธรรมนูญฉบับแรกแห่งราชอาณาจักรสยาม', 'ประกาศใช้เมื่อวันที่', '27', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยเป็นผลพวงหลังการปฏิวัติเมื่อวันที่', '24', 'มิถุนายน', 'พ.ศ.', '2475', 'โดยคณะราษฎร']

Word¶

Default word tokenizer ("newmm") use maximum matching algorithm.

In [25]:

from pythainlp import word_tokenize

text = "ก็จะรู้ความชั่วร้ายที่ทำไว้     และคงจะไม่ยอมให้ทำนาบนหลังคน "

print("default (newmm):")
print(word_tokenize(text))
print("\nnewmm and keep_whitespace=False:")
print(word_tokenize(text, keep_whitespace=False))

default (newmm):
['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', '     ', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน', ' ']

newmm and keep_whitespace=False:
['ก็', 'จะ', 'รู้ความ', 'ชั่วร้าย', 'ที่', 'ทำ', 'ไว้', 'และ', 'คงจะ', 'ไม่', 'ยอมให้', 'ทำนาบนหลังคน']

Other algorithm can be chosen. We can also create a tokenizer with a custom dictionary.

In [3]:

from pythainlp import word_tokenize, Tokenizer

text = "กฎหมายแรงงานฉบับปรับปรุงใหม่ประกาศใช้แล้ว"

print("newmm  :", word_tokenize(text))  # default engine is "newmm"
print("longest:", word_tokenize(text, engine="longest"))

words = ["แรงงาน"]
custom_tokenizer = Tokenizer(words)
print("newmm (custom dictionary):", custom_tokenizer.word_tokenize(text))

newmm  : ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศ', 'ใช้แล้ว']
longest: ['กฎหมายแรงงาน', 'ฉบับ', 'ปรับปรุง', 'ใหม่', 'ประกาศใช้', 'แล้ว']
newmm (custom dictionary): ['กฎหมาย', 'แรงงาน', 'ฉบับปรับปรุงใหม่ประกาศใช้แล้ว']

Default word tokenizer use a word list from pythainlp.corpus.common.thai_words(). We can get that list, add/remove words, and create new tokenizer from the modified list.

In [4]:

from pythainlp.corpus.common import thai_words
from pythainlp import Tokenizer

text = "นิยายวิทยาศาสตร์ของไอแซค อสิมอฟ"

print("default dictionary:", word_tokenize(text))

words = set(thai_words())  # thai_words() returns frozenset
words.add("ไอแซค")  # Isaac
words.add("อสิมอฟ")  # Asimov
custom_tokenizer = Tokenizer(words)
print("custom dictionary :", custom_tokenizer.word_tokenize(text))

default dictionary: ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิ', 'มอ', 'ฟ']
custom dictionary : ['นิยาย', 'วิทยาศาสตร์', 'ของ', 'ไอแซค', ' ', 'อสิมอฟ']

We can also, alternatively, create a dictionary trie, using pythainlp.util.Trie() function, and pass it to a default tokenizer.

In [5]:

from pythainlp.corpus.common import thai_words
from pythainlp.util import Trie

text = "ILO87 ว่าด้วยเสรีภาพในการสมาคมและการคุ้มครองสิทธิในการรวมตัว ILO98 ว่าด้วยสิทธิในการรวมตัวและการร่วมเจรจาต่อรอง"

print("default dictionary:", word_tokenize(text))

new_words = {"ILO87", "ILO98", "การร่วมเจรจาต่อรอง", "สิทธิในการรวมตัว", "เสรีภาพในการสมาคม", "แรงงานสัมพันธ์"}
words = new_words.union(thai_words())

custom_dictionary_trie = Trie(words)
print("custom dictionary :", word_tokenize(text, custom_dict=custom_dictionary_trie))

default dictionary: ['ILO', '87', ' ', 'ว่าด้วย', 'เสรีภาพ', 'ใน', 'การสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', ' ', 'ILO', '98', ' ', 'ว่าด้วย', 'สิทธิ', 'ใน', 'การ', 'รวมตัว', 'และ', 'การ', 'ร่วม', 'เจรจา', 'ต่อรอง']
custom dictionary : ['ILO87', ' ', 'ว่าด้วย', 'เสรีภาพในการสมาคม', 'และ', 'การ', 'คุ้มครอง', 'สิทธิในการรวมตัว', ' ', 'ILO98', ' ', 'ว่าด้วย', 'สิทธิในการรวมตัว', 'และ', 'การร่วมเจรจาต่อรอง']

Testing different tokenization engines

In [29]:

speedtest_text = """
ครบรอบ 14 ปี ตากใบ เช้าวันนั้น 25 ต.ค. 2547 ผู้ชุมนุมชายกว่า 1,370 คน
ถูกโยนขึ้นรถยีเอ็มซี 22 หรือ 24 คัน นอนซ้อนกันคันละ 4-5 ชั้น เดินทางจากสถานีตำรวจตากใบ ไปไกล 150 กิโลเมตร
ไปถึงค่ายอิงคยุทธบริหาร ใช้เวลากว่า 6 ชั่วโมง / ในอีกคดีที่ญาติฟ้องร้องรัฐ คดีจบลงที่การประนีประนอมยอมความ
กระทรวงกลาโหมจ่ายค่าสินไหมทดแทนรวม 42 ล้านบาทให้กับญาติผู้เสียหาย 79 ราย
ปิดหีบและนับคะแนนเสร็จแล้ว ที่หน่วยเลือกตั้งที่ 32 เขต 13 แขวงหัวหมาก เขตบางกะปิ กรุงเทพมหานคร
ผู้สมัคร ส.ส. และตัวแทนพรรคการเมืองจากหลายพรรคต่างมาเฝ้าสังเกตการนับคะแนนอย่างใกล้ชิด โดย
ฐิติภัสร์ โชติเดชาชัยนันต์ จากพรรคพลังประชารัฐ และพริษฐ์ วัชรสินธุ จากพรรคประชาธิปัตย์ได้คะแนน
96 คะแนนเท่ากัน
เช้าวันอาทิตย์ที่ 21 เมษายน 2019 ซึ่งเป็นวันอีสเตอร์ วันสำคัญของชาวคริสต์
เกิดเหตุระเบิดต่อเนื่องในโบสถ์คริสต์และโรงแรมอย่างน้อย 7 แห่งในประเทศศรีลังกา
มีผู้เสียชีวิตแล้วอย่างน้อย 156 คน และบาดเจ็บหลายร้อยคน ยังไม่มีข้อมูลว่าผู้ก่อเหตุมาจากฝ่ายใด
จีนกำหนดจัดการประชุมข้อริเริ่มสายแถบและเส้นทางในช่วงปลายสัปดาห์นี้ ปักกิ่งยืนยันว่า
อภิมหาโครงการเชื่อมโลกของจีนไม่ใช่เครื่องมือแผ่อิทธิพล แต่ยินดีรับฟังข้อวิจารณ์ เช่น ประเด็นกับดักหนี้สิน
และความไม่โปร่งใส รัฐบาลปักกิ่งบอกว่า เวทีประชุม Belt and Road Forum ในช่วงวันที่ 25-27 เมษายน
ถือเป็นงานการทูตที่สำคัญที่สุดของจีนในปี 2019
"""

In [30]:

# Speed test: Calling "longest" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="longest")

CPU times: user 253 ms, sys: 2.27 ms, total: 256 ms
Wall time: 255 ms

In [31]:

# Speed test: Calling "newmm" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="newmm")

CPU times: user 3.4 ms, sys: 60 µs, total: 3.46 ms
Wall time: 3.47 ms

In [32]:

# Speed test: Calling "newmm" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="newmm-safe")

CPU times: user 4.08 ms, sys: 88 µs, total: 4.16 ms
Wall time: 4.15 ms

In [33]:

#!pip install attacut
# Speed test: Calling "attacut" engine through word_tokenize wrapper
%time tokens = word_tokenize(speedtest_text, engine="attacut")

CPU times: user 833 ms, sys: 174 ms, total: 1.01 s
Wall time: 576 ms

Get all possible segmentations

In [34]:

from pythainlp.tokenize.multi_cut import find_all_segment, mmcut, segment

find_all_segment("มีความเป็นไปได้อย่างไรบ้าง")

Out[34]:

['มี|ความ|เป็น|ไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็นไปได้|อย่าง|ไร|บ้าง|',
 'มี|ความเป็นไป|ได้|อย่าง|ไร|บ้าง|',
 'มี|ความเป็นไปได้|อย่าง|ไร|บ้าง|',
 'มี|ความ|เป็น|ไป|ได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็นไปได้|อย่างไร|บ้าง|',
 'มี|ความเป็นไป|ได้|อย่างไร|บ้าง|',
 'มี|ความเป็นไปได้|อย่างไร|บ้าง|',
 'มี|ความ|เป็น|ไป|ได้|อย่างไรบ้าง|',
 'มี|ความ|เป็นไป|ได้|อย่างไรบ้าง|',
 'มี|ความ|เป็นไปได้|อย่างไรบ้าง|',
 'มี|ความเป็นไป|ได้|อย่างไรบ้าง|',
 'มี|ความเป็นไปได้|อย่างไรบ้าง|']

Subword, syllable, and Thai Character Cluster (TCC)¶

Tokenization can also be done at subword level, either syllable or Thai Character Cluster (TCC).

Syllable segmentation is using ssg, a CRF syllable segmenter for Thai by Ponrawee Prasertsom.
TCC is smaller than syllable. For information about TCC, see Character Cluster Based Thai Information Retrieval (Theeramunkong et al. 2004).

Subword tokenization¶

Default subword tokenization engine is tcc, which will use Thai Character Cluster (TCC) as a subword unit.

In [35]:

from pythainlp import subword_tokenize

subword_tokenize("ประเทศไทย")  # default subword unit is TCC

Out[35]:

['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']

Syllable tokenization¶

Default syllable tokenization engine is dict, which will use newmm word tokenization engine with a custom dictionary contains known syllables in Thai language.

In [36]:

from pythainlp.tokenize import syllable_tokenize

text = "อับดุลเลาะ อีซอมูซอ สมองบวมรุนแรง"

syllable_tokenize(text)  # default engine is "dict"

Out[36]:

['อับ',
 'ดุล',
 'เลาะ',
 ' ',
 'อี',
 'ซอ',
 'มู',
 'ซอ',
 ' ',
 'สมอง',
 'บวม',
 'รุน',
 'แรง']

External ssg engine call be called. Note that ssg engine ommitted whitespaces in the output tokens.

In [37]:

syllable_tokenize(text, engine="ssg")  # use "ssg" for syllable

Out[37]:

['อับ', 'ดุล', 'เลาะ', ' อี', 'ซอ', 'มู', 'ซอ ', 'สมอง', 'บวม', 'รุน', 'แรง']

Low-level subword operations¶

These low-level TCC operations can be useful for some pre-processing tasks. Like checking if it's ok to cut a string at a certain point or to find typos.

In [38]:

from pythainlp.tokenize import tcc

tcc.segment("ประเทศไทย")

Out[38]:

['ป', 'ระ', 'เท', 'ศ', 'ไท', 'ย']

In [39]:

tcc.tcc_pos("ประเทศไทย")  # return positions

Out[39]:

{1, 3, 5, 6, 8, 9}

In [40]:

for ch in tcc.tcc("ประเทศไทย"):  # TCC generator
    print(ch, end='-')

ป-ระ-เท-ศ-ไท-ย-

Transliteration¶

There are two types of transliteration here: romanization and transliteration.

Romanization will render Thai words in the Latin alphabet using the Royal Thai General System of Transcription (RTGS).
- Two engines are supported here: a simple royin engine (default) and a more accurate thai2rom engine.
Transliteration here, in PyThaiNLP context, means the sound representation of a string.
- Two engines are supported here: ipa (International Phonetic Alphabet system, using Epitran) (default) and icu (International Components for Unicode, using PyICU).

In [41]:

from pythainlp.transliterate import romanize

romanize("แมว")  # output: 'maeo'

Out[41]:

'maeo'

In [42]:

romanize("ภาพยนตร์")  # output: 'phapn' (*obviously wrong)

Out[42]:

'phapn'

In [43]:

from pythainlp.transliterate import transliterate

transliterate("แมว")  # output: 'mɛːw'

Update Corpus...
Corpus: thai-g2p
- Already up to date.

Out[43]:

'm ɛː w ˧'

In [44]:

transliterate("ภาพยนตร์")  # output: 'pʰaːpjanot'

Out[44]:

'pʰ aː p̚ ˥˩ . pʰ a ˦˥ . j o n ˧'

Normalization¶

normalize() removes zero-width spaces (ZWSP and ZWNJ), duplicated spaces, repeating vowels, and dangling characters. It also reorder vowels and tone marks during the process of removing repeating vowels.

In [45]:

from pythainlp.util import normalize

normalize("เเปลก") == "แปลก"  # เ เ ป ล ก  vs แ ป ล ก

Out[45]:

True

The string below contains a non-standard order of Thai characters, Sara Aa (following vowel) + Mai Ek (upper tone mark). normalize() will reorder it to Mai Ek + Sara Aa.

In [46]:

text = "เกา่"
normalize(text)

Out[46]:

'เก่า'

This can be useful for string matching, including tokenization.

In [47]:

from pythainlp import word_tokenize

text = "เก็บวันน้ี พรุ่งน้ีก็เกา่"

print("tokenize immediately:")
print(word_tokenize(text))
print("\nnormalize, then tokenize:")
print(word_tokenize(normalize(text)))

tokenize immediately:
['เก็บ', 'วัน', 'น้ี', ' ', 'พรุ่งน้ี', 'ก็', 'เกา', '่']

normalize, then tokenize:
['เก็บ', 'วันนี้', ' ', 'พรุ่งนี้', 'ก็', 'เก่า']

The string below contains repeating vowels (multiple Sara A in a row) normalize() will keep only one of them. It can be use to reduce variations in spellings, useful for classification task.

In [48]:

normalize("เกะะะ")

Out[48]:

'เกะ'

Internally, normalize() is just a series of function calls like this:

text = remove_zw(text)
text = remove_dup_spaces(text)
text = remove_repeat_vowels(text)
text = remove_dangling(text)

If you don't like the behavior of default normalize(), you can call those functions shown above, also remove_tonemark() and reorder_vowels(), individually from pythainlp.util, to customize your own normalization.

Digit conversion¶

Thai text sometimes use Thai digits. This can reduce performance for classification and searching. PyThaiNP provides few utility functions to deal with this.

In [49]:

from pythainlp.util import arabic_digit_to_thai_digit, thai_digit_to_arabic_digit, digit_to_text

text = "ฉุกเฉินที่ยุโรปเรียก 112 ๑๑๒"

arabic_digit_to_thai_digit(text)

Out[49]:

'ฉุกเฉินที่ยุโรปเรียก ๑๑๒ ๑๑๒'

In [50]:

thai_digit_to_arabic_digit(text)

Out[50]:

'ฉุกเฉินที่ยุโรปเรียก 112 112'

In [51]:

digit_to_text(text)

Out[51]:

'ฉุกเฉินที่ยุโรปเรียก หนึ่งหนึ่งสอง หนึ่งหนึ่งสอง'

Soundex¶

"Soundex is a phonetic algorithm for indexing names by sound." (Wikipedia). PyThaiNLP provides three kinds of Thai soundex.

In [52]:

from pythainlp.soundex import lk82, metasound, udom83

# check equivalence
print(lk82("รถ") == lk82("รด"))
print(udom83("วรร") == udom83("วัน"))
print(metasound("นพ") == metasound("นภ"))

True
True
True

In [53]:

texts = ["บูรณะ", "บูรณการ", "มัก", "มัค", "มรรค", "ลัก", "รัก", "รักษ์", ""]
for text in texts:
    print(
        "{} - lk82: {} - udom83: {} - metasound: {}".format(
            text, lk82(text), udom83(text), metasound(text)
        )
    )

บูรณะ - lk82: บE400 - udom83: บ930000 - metasound: บ550
บูรณการ - lk82: บE419 - udom83: บ931900 - metasound: บ551
มัก - lk82: ม1000 - udom83: ม100000 - metasound: ม100
มัค - lk82: ม1000 - udom83: ม100000 - metasound: ม100
มรรค - lk82: ม1000 - udom83: ม310000 - metasound: ม551
ลัก - lk82: ร1000 - udom83: ร100000 - metasound: ล100
รัก - lk82: ร1000 - udom83: ร100000 - metasound: ร100
รักษ์ - lk82: ร1000 - udom83: ร100000 - metasound: ร100
 - lk82:  - udom83:  - metasound:

Spellchecking¶

Default spellchecker uses Peter Norvig's algorithm together with word frequency from Thai National Corpus (TNC).

spell() returns a list of all possible spellings.

In [54]:

from pythainlp import spell

spell("เหลืยม")

Out[54]:

['เหลียม', 'เหลือม']

correct() returns the most likely spelling.

In [55]:

from pythainlp import correct

correct("เหลืยม")

Out[55]:

'เหลียม'

Spellchecking - Custom dictionary and word frequency¶

Custom dictionary can be provided when creating spellchecker.

When create a NorvigSpellChecker object, you can pass a custom dictionary to custom_dict parameter.

custom_dict can be:

a dictionary (dict), with words (str) as keys and frequencies (int) as values; or
a list, a tuple, or a set of (word, frequency) tuples; or
a list, a tuple, or a set of just words, without their frequencies -- in this case 1 will be assigned to every words.

In [56]:

from pythainlp.spell import NorvigSpellChecker

user_dict = [("เหลียม", 50), ("เหลือม", 1000), ("เหลียว", 1000000)]
checker = NorvigSpellChecker(custom_dict=user_dict)

checker.spell("เหลืยม")

Out[56]:

['เหลือม', 'เหลียม']

As you can see, our version of NorvigSpellChecker gives the edit distance a priority over a word frequency.

You can use word frequencies from Thai National Corpus and Thai Textbook Corpus as well.

By default, NorvigSpellChecker uses Thai National Corpus.

In [57]:

from pythainlp.corpus import ttc  # Thai Textbook Corpus

checker = NorvigSpellChecker(custom_dict=ttc.word_freqs())

checker.spell("เหลืยม")

Out[57]:

['เหลือม']

In [58]:

checker.correct("เหลืยม")

Out[58]:

'เหลือม'

To check the current dictionary of a spellchecker:

In [59]:

list(checker.dictionary())[1:10]

Out[59]:

[('พิธีเปิด', 18),
 ('ไส้กรอก', 40),
 ('ปลิง', 6),
 ('เต็ง', 13),
 ('ขอบคุณ', 356),
 ('ประสาน', 84),
 ('รำไร', 11),
 ('ร่วมท้อง', 4),
 ('ฝักมะขาม', 3)]

We can also apply conditions and filter function to dictionary when creating spellchecker.

In [60]:

checker = NorvigSpellChecker()  # use default filter (remove any word with number or non-Thai character)
len(checker.dictionary())

Out[60]:

In [61]:

checker = NorvigSpellChecker(min_freq=5, min_len=2, max_len=15)
len(checker.dictionary())

Out[61]:

In [62]:

checker_no_filter = NorvigSpellChecker(dict_filter=None)  # use no filter
len(checker_no_filter.dictionary())

Out[62]:

In [63]:

def remove_yamok(word):
    return False if "ๆ" in word else True

checker_custom_filter = NorvigSpellChecker(dict_filter=remove_yamok)  # use custom filter
len(checker_custom_filter.dictionary())

Out[63]:

Part-of-Speech Tagging¶

In [64]:

from pythainlp.tag import pos_tag, pos_tag_sents

pos_tag(["การ","เดินทาง"])

Out[64]:

[('การ', 'FIXN'), ('เดินทาง', 'VACT')]

In [65]:

sents = [["ประกาศสำนักนายกฯ", " ", "ให้",
    " ", "'พล.ท.สรรเสริญ แก้วกำเนิด'", " ", "พ้นจากตำแหน่ง",
    " ", "ผู้ทรงคุณวุฒิพิเศษ", "กองทัพบก", " ", "กระทรวงกลาโหม"],
    ["และ", "แต่งตั้ง", "ให้", "เป็น", "'อธิบดีกรมประชาสัมพันธ์'"]]

pos_tag_sents(sents)

Out[65]:

[[('ประกาศสำนักนายกฯ', 'NCMN'),
  (' ', 'PUNC'),
  ('ให้', 'JSBR'),
  (' ', 'PUNC'),
  ("'พล.ท.สรรเสริญ แก้วกำเนิด'", 'NCMN'),
  (' ', 'PUNC'),
  ('พ้นจากตำแหน่ง', 'NCMN'),
  (' ', 'PUNC'),
  ('ผู้ทรงคุณวุฒิพิเศษ', 'NCMN'),
  ('กองทัพบก', 'NCMN'),
  (' ', 'PUNC'),
  ('กระทรวงกลาโหม', 'NCMN')],
 [('และ', 'JCRG'),
  ('แต่งตั้ง', 'VACT'),
  ('ให้', 'JSBR'),
  ('เป็น', 'VSTA'),
  ("'อธิบดีกรมประชาสัมพันธ์'", 'NCMN')]]

Named-Entity Tagging¶

The tagger use BIO scheme:

B - beginning of entity
I - inside entity
O - outside entity

In [66]:

#!pip3 install pythainlp[ner]
from pythainlp.tag.named_entity import ThaiNameTagger

ner = ThaiNameTagger()
ner.get_ner("24 มิ.ย. 2563 ทดสอบระบบเวลา 6:00 น. เดินทางจากขนส่งกรุงเทพใกล้ถนนกำแพงเพชร ไปจังหวัดกำแพงเพชร ตั๋วราคา 297 บาท")

Out[66]:

[('24', 'NUM', 'B-DATE'),
 (' ', 'PUNCT', 'I-DATE'),
 ('มิ.ย.', 'NOUN', 'I-DATE'),
 (' ', 'PUNCT', 'O'),
 ('2563', 'NUM', 'O'),
 (' ', 'PUNCT', 'O'),
 ('ทดสอบ', 'VERB', 'O'),
 ('ระบบ', 'NOUN', 'O'),
 ('เวลา', 'NOUN', 'O'),
 (' ', 'PUNCT', 'O'),
 ('6', 'NUM', 'B-TIME'),
 (':', 'PUNCT', 'I-TIME'),
 ('00', 'NUM', 'I-TIME'),
 (' ', 'PUNCT', 'I-TIME'),
 ('น.', 'NOUN', 'I-TIME'),
 (' ', 'PUNCT', 'O'),
 ('เดินทาง', 'VERB', 'O'),
 ('จาก', 'ADP', 'O'),
 ('ขนส่ง', 'NOUN', 'B-ORGANIZATION'),
 ('กรุงเทพ', 'NOUN', 'I-ORGANIZATION'),
 ('ใกล้', 'ADJ', 'O'),
 ('ถนน', 'NOUN', 'B-LOCATION'),
 ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),
 (' ', 'PUNCT', 'O'),
 ('ไป', 'AUX', 'O'),
 ('จังหวัด', 'VERB', 'B-LOCATION'),
 ('กำแพงเพชร', 'NOUN', 'I-LOCATION'),
 (' ', 'PUNCT', 'O'),
 ('ตั๋ว', 'NOUN', 'O'),
 ('ราคา', 'NOUN', 'O'),
 (' ', 'PUNCT', 'O'),
 ('297', 'NUM', 'B-MONEY'),
 (' ', 'PUNCT', 'I-MONEY'),
 ('บาท', 'NOUN', 'I-MONEY')]

Word Vector¶

In [67]:

import pythainlp.word_vector

pythainlp.word_vector.similarity("คน", "มนุษย์")

Out[67]:

0.2504981

In [68]:

pythainlp.word_vector.doesnt_match(["คน", "มนุษย์", "บุคคล", "เจ้าหน้าที่", "ไก่"])

/usr/local/lib/python3.7/site-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)

Out[68]:

'ไก่'

Number Spell Out¶

In [69]:

from pythainlp.util import bahttext

bahttext(1234567890123.45)

Out[69]:

'หนึ่งล้านสองแสนสามหมื่นสี่พันห้าร้อยหกสิบเจ็ดล้านแปดแสนเก้าหมื่นหนึ่งร้อยยี่สิบสามบาทสี่สิบห้าสตางค์'

bahttext() will round the satang part

In [70]:

bahttext(1.909)

Out[70]:

'หนึ่งบาทเก้าสิบเอ็ดสตางค์'