Tokenization or Text segmentation is the problem of dividing a string of written language into its component words.
The most simple way to divide a text into a list of its words is to split over the whitespaces.
text = "Let's eat, grandpa"
print(text.split())
The problem with that approach is that contractions (Let's -> Let + s) are not handled and punctuations signs stay attached to the nearest word (eat, -> eat + ,).
The right way to tokenize is to use a tokenizer. Most NLP libraries offer their own tokenizers. Here we will use tokenizers from the NLTK library.
The NLTK library offers many tokenizers. We'll work with the WordPunctTokenizer.
But first let's install NLTK and download the necessary resources.
!pip install nltk
import nltk
nltk.download('popular')
We get a different results. The punctuations are now handled as tokens.
from nltk.tokenize import WordPunctTokenizer
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
print(tokens)
Let's tokenize the text from the Wikipedia Earth page and look at the frequency of the most common words.
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
import requests
def wikipedia_page(title):
'''
This function returns the raw text of a wikipedia page
given a wikipedia page title
'''
params = {
'action': 'query',
'format': 'json', # request json formatted content
'titles': title, # title of the wikipedia page
'prop': 'extracts',
'explaintext': True
}
# send a request to the wikipedia api
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params= params
).json()
# Parse the result
page = next(iter(response['query']['pages'].values()))
# return the page content
if 'extract' in page.keys():
return page['extract']
else:
return "Page not found"
text = wikipedia_page('Earth').lower()
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))
We now see that earth and earth's for instance are no longer separate tokens and that the punctuation signs are stand alone tokens. This will come in handy if we want to remove them.
We can also tokenize on characters instead of words.
# example of character tokenization
char_tokens = [ c for c in text ]
print("Most common characters in the text")
print(Counter(char_tokens).most_common(20))
print()
print(f"All characters in the text: \n{set(char_tokens)}")
Some words are better taken together: New York, Happy end, Wall street, Linear regression etc ... . When tokenizing we want to consider all possible adjacent pairs of words in the text. We can do this with the NLTK ngrams function
from nltk import ngrams
text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?".lower()
# Tokenize
tokens = WordPunctTokenizer().tokenize(text)
# bigrams
bigrams = [w for w in ngrams(tokens,n=2)]
print(bigrams)
print()
bigrams = ['_'.join(bg) for bg in bigrams]
print(bigrams)
# and for trigrams
trigrams = ['_'.join(w) for w in ngrams(tokens,n=3)]
print(trigrams)
Let's add the bigrams and trigrams to the list of tokens on the wikipedia Earth page and look at the frequency of ngrams.
text = wikipedia_page('Earth').lower()
unigrams = WordPunctTokenizer().tokenize(text)
bigrams = ['_'.join(w) for w in ngrams(unigrams,n=2)]
trigrams = ['_'.join(w) for w in ngrams(unigrams,n=3)]
tokens = unigrams + bigrams + trigrams
print(f"we have a total of {len(tokens)} tokens, including: \n- {len(unigrams)} unigrams \n- {len(bigrams)} bigrams \n- {len(trigrams)} trigrams. ")
Counter(tokens).most_common(50)
We have multiple bigrams in the top 50 tokens:
Adding ngrams to a list of tokens may help down the line when classifying text.