Tokenization¶

Tokenization or Text segmentation is the problem of dividing a string of written language into its component words.

The most simple way to divide a text into a list of its words is to split over the whitespaces.

In [ ]:

text = "Let's eat, grandpa"
print(text.split())

The problem with that approach is that contractions (Let's -> Let + s) are not handled and punctuations signs stay attached to the nearest word (eat, -> eat + ,).

The right way to tokenize is to use a tokenizer. Most NLP libraries offer their own tokenizers. Here we will use tokenizers from the NLTK library.

The NLTK library offers many tokenizers. We'll work with the WordPunctTokenizer.

But first let's install NLTK and download the necessary resources.

In [ ]:

!pip install nltk

In [ ]:

import nltk
nltk.download('popular')

Apply the WordPunctTokenizer¶

We get a different results. The punctuations are now handled as tokens.

In [ ]:

from nltk.tokenize import WordPunctTokenizer
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
print(tokens)

Let's tokenize the text from the Wikipedia Earth page and look at the frequency of the most common words.

In [ ]:

from nltk.tokenize import WordPunctTokenizer
from collections import Counter
import requests

def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page 
    given a wikipedia page title
    '''
    params = { 
        'action': 'query', 
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts', 
        'explaintext': True
    }
    # send a request to the wikipedia api 
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content 
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"

In [ ]:

text = wikipedia_page('Earth').lower()
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))

We now see that earth and earth's for instance are no longer separate tokens and that the punctuation signs are stand alone tokens. This will come in handy if we want to remove them.

Tokenization on characters¶

We can also tokenize on characters instead of words.

In [ ]:

# example of character tokenization
char_tokens = [ c for c in text ]
print("Most common characters in the text")
print(Counter(char_tokens).most_common(20))
print()
print(f"All characters in the text: \n{set(char_tokens)}")

N-grams¶

Some words are better taken together: New York, Happy end, Wall street, Linear regression etc ... . When tokenizing we want to consider all possible adjacent pairs of words in the text. We can do this with the NLTK ngrams function

In [ ]:

from nltk import ngrams

text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?".lower()

# Tokenize
tokens = WordPunctTokenizer().tokenize(text)

# bigrams 
bigrams = [w for w in  ngrams(tokens,n=2)]
print(bigrams)

print()
bigrams = ['_'.join(bg) for bg in bigrams]
print(bigrams)

In [ ]:

# and for trigrams

trigrams = ['_'.join(w) for w in  ngrams(tokens,n=3)]

print(trigrams)

add ngrams to list of tokens¶

Let's add the bigrams and trigrams to the list of tokens on the wikipedia Earth page and look at the frequency of ngrams.

In [ ]:

text = wikipedia_page('Earth').lower()
unigrams = WordPunctTokenizer().tokenize(text)
bigrams = ['_'.join(w) for w in  ngrams(unigrams,n=2)]
trigrams = ['_'.join(w) for w in  ngrams(unigrams,n=3)]

In [ ]:

tokens = unigrams + bigrams + trigrams

In [ ]:

print(f"we have a total of {len(tokens)} tokens, including: \n- {len(unigrams)} unigrams \n- {len(bigrams)} bigrams \n- {len(trigrams)} trigrams. ")

In [ ]:

Counter(tokens).most_common(50)

We have multiple bigrams in the top 50 tokens:

of_the
of_earth
in_the

Adding ngrams to a list of tokens may help down the line when classifying text.

In [ ]: