Notebook

Torchtext Tutorial 01: Getting Started¶

0. Brief overview¶

In this tutorial we will...¶

create Field objects for preprocessing / tokenizing text
create Dataset objects with input text files and Field object(s) which return tokenized and processed text data
create Vocab objects that contain the vocabulary and word embeddings of the words from our data

In [ ]:

# Here's a brief implementation of what we'll be doing in this session

import torch
import spacy
from torchtext import data, datasets

# 1. create Field objects for preprocessing / tokenizing text
TEXT = data.Field(tokenize=data.get_tokenizer('spacy'), 
                  init_token='<SOS>', eos_token='<EOS>',lower=True)

# 2. create Dataset objects with input text files and Field object(s) which return tokenized and processed text data
train,val,test = datasets.WikiText2.splits(text_field=TEXT)

# 3. create Vocab objects that contain the vocabulary and word embeddings of the words from our data
TEXT.build_vocab(train, wv_type='glove.6B')
vocab = TEXT.vocab

1. Setting a Field object¶

Purpose of a Field object¶

Intuitively, it sets rules for preprocessing the input text, and creates a vocabulary of the words that are introduced from the data

Things you can do with a Field object¶

1.1. apply self-defined or external (spacy is recommended) tokenizers to tokenize strings into word tokens
1.2. automatically convert word tokens (strings) to indices (ints)
1.3. automatically add SOS(start-of-sentence) or EOS(end-of-sentence) tokens to input strings
1.4. convert text to lowercase
1.5. determine whether to pad sentences to a fixed length or leave them as variable lengths

Things you CAN'T do in a Field object¶

print batches of text data (remember, a Field defines how to tokenize/preprocess/label your data and arrange a vocabulary, and does not store the data itself)

MISC.¶

Spacy is a NLP package in Python which provides many features such as tokenizing, POS-tagging, preprocessing etc.

In [ ]:

import torch
from torchtext import data, datasets

In [ ]:

# 1.0. setting up a default0 Field object
TEXT = data.Field()

In [ ]:

# 1.1. applying an external tokenizer

"""
The default tokenizer implemented in torchtext is merely
a .split() function, so we usually have to use our own versions.
Luckily, the most commonly used 'spacy' tokenizer can be
easily called.
Here we call a spacy tokenizer for English, and add it to our Field.
"""

tokenizer = data.get_tokenizer('spacy') # spacy tokenizer function
test_string = 'This is a string to be tokenized...'
print('original string: ',test_string)
print('tokenized string: ',list(tokenizer(test_string)))

TEXT = data.Field(tokenize=tokenizer) # add our tokenizer to our Field object

In [ ]:

"""
Or you can do this manually by calling it from the spacy package.
FYI, spacy provides tokenizers of MANY languages.
"""

import spacy
spacy_en = spacy.load('en') # the default English package by Spacy

def tokenizer2(text): # create a tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]

test_string = 'This is a string to be tokenized...'
print('original string: ',test_string)
print('tokenized string: ',list(tokenizer(test_string)))

TEXT = data.Field(tokenize=tokenizer2)

In [ ]:

# 1.2. converting word tokens to indices

"""
use_vocab is set default to True. By setting it to False, 
we instead take number indices as inputs.
In most cases your inputs will in text and not integer indices, 
so you won't be using this feature that much.
"""

TEXT = data.Field(use_vocab=False)

In [ ]:

# 1.3. Adding SOS and EOS tokens to input strings

"""
In seq2seq models, it is (almost) necessary to append 
SOS(start-of-sentence) and EOS(end-of-sentence) tokens to let the model 
determine when to start and stop generating new words. 
These are applied to the dataset created using this Field.
"""

TEXT = data.Field(init_token='<SOS>', eos_token='<EOS>')

In [ ]:

# 1.4. converting text to lowercase

"""
Converts all words to lower case
"""

TEXT = data.Field(lower=True)

In [ ]:

# 1.5. fixed vs variable lengths

"""
In default, RNNs in Pytorch are invariant of sequence length, 
so it is possible to train models using variable sequence lengths.
However, you may have to fix input lengths 
such as when using CNNs and other models.

Note: Even when set to variable length, when working in batches
the all sentences are padded to match the longest sentence in a batch.
"""

TEXT = data.Field(fix_length=40) # shorter strings will be padded

In [ ]:

# Actual Field object to be used for next step

TEXT = data.Field(tokenize=tokenizer, init_token='<SOS>', eos_token='<EOS>',
                 lower=True)
vars(TEXT)

2. Creating a Dataset object¶

Here we will create a dataset for language modelling. The input text data will be tokenized and preprocessed according to our Field settings.

Ingredients!¶

a Field object used to store the vocabulary of the text file
the path to a text file
an appropriate Dataset class

Things you can do with a Dataset object¶

2.1. print examples from the text

Introducing Datasets of various purposes¶

2.2. Language modelling (WikiText2)
2.3. Sentiment analysis (SST)

MISC.¶

Creating custom datasets will be covered in subsequent tutorials

In [ ]:

# create a LanguageModelingDataset instance using TEXT as Field

"""
dataset: Cornell Movie-Dialogs Corpus
"""

lang = datasets.LanguageModelingDataset(path='movie.txt',
                                       text_field=TEXT)

In [ ]:

# 2.1. print examples from text

"""
You can print examples(=all text) of the given text using Dataset.examples.
When using the basic LanguageModelingDataset, the entire text corpus will
be stored as a long list of word tokens.
"""

examples = lang.examples
print("Number of tokens: ", len(examples[0].text))
print("\n")
print("Print first 100 tokens: ",examples[0].text[:100])
print("\n")
print("Print last 10 tokens: ",examples[0].text[-10:])

In [ ]:

# 2.2. Dataset for language modelling

"""
Torchtext provides the Wikitext dataset as downloadable for language modelling.
Through the .splits() function of the WikiText2 class, we can create
training / validation / test datasets.
Note that this class can ONLY read from the WikiText2 data.
"""

# get Field
TEXT_wiki = data.Field(tokenize=tokenizer, init_token='<SOS>', eos_token='<EOS>',
                 lower=True)

# split into train, val, test
train, val, test = datasets.WikiText2.splits(text_field=TEXT_wiki)

In [ ]:

# 2.3. Dataset for sentiment analysis

"""
We will use the same .split() function provided by the SST class in torchtext.
The only difference is that we also need a label field to keep a vocabulary of
the labels from 'very negative' to 'very positive'
"""

# get Fields - sentiment analysis also requires a label field
TEXT_sst = data.Field(tokenize=tokenizer, init_token='<SOS>', eos_token='<EOS>',
                 lower=True)
LABEL_sst = data.Field(sequential=False)

# split into train, val, test
train, val, test = datasets.SST.splits(text_field=TEXT_sst, label_field=LABEL_sst)

In [ ]:

# 2.4. Dataset for natural language inference

# WIP

3. Using a Vocab object¶

Now you can create a vocabulary of the words from the text file stored in your predefined Field object, TEXT. You first have to build a vocabulary in your Field object using .build_vocab() with your dataset as input. Then you can access it using TEXT.vocab, which is a Vocab object also defined by TorchText. Here is a list of the features provided by Vocab.

Things you can do with a Vocab object¶

3.1. View vocabulary information (size, frequency of words)
3.2. View the created string2index list (stoi) and index2string dict (itos)
3.3. Create purpose-specific vocabularies (requires a Counter object)
3.4. Load external word embeddings
3.5. Easily handle unknown words

In [ ]:

# 3.0. building a vocabulary

"""
We build a vocabulary of words included in the dataset.
Note that we can only build on the Field object that was used to create the dataset.
"""

TEXT.build_vocab(lang) # use dataset as input
vocab = TEXT.vocab

In [ ]:

# 3.1. vocabulary information (size, frequency of words, etc.)

"""
You can view information of the vocabulary.
Vocab.freqs returns a Counter object that stores the frequency of
all words that appeared in the text.
"""

print("Vocabulary size: ", len(vocab))
print("10 most frequent words: ", vocab.freqs.most_common(10))

In [ ]:

# 3.2. string2index (stoi), index2string (itos)

"""
The created Vocab object contains an index mapping for each word.
"""

print("First 10 words: ", vocab.itos[0:10])
print("First 10 words of text data: ", lang.examples[0].text[:10])
print("Index of the first word: ", vocab.stoi[lang.examples[0].text[0]])

In [ ]:

# 3.3. create purpose-specific vocabularies (requires a Counter object)

"""
The Vocabulary object created from Field have many parameters set to default.
We can create a new vocabulary using any Counter object 
(e.g. the Counter object from our initial vocabulary). 
Our new vocabulary may
1) only contains word which appear more than N times,
2) be smaller than a given maximum size
"""

counter = vocab.freqs # frequency of the original vocabulary created by Field
vocab2 = data.Vocab(counter=counter,min_freq=10) # discard words appearing less than 10 times
vocab3 = data.Vocab(counter=counter,max_size=100000) # set max number of words for a vocabulary

print(len(vocab))
print(len(vocab2))
print(len(vocab3))

In [ ]:

# 3.4. load external word embeddings

"""
We can load external word embeddings such as word2vec or glove with Vocab objects.
Here we build a Vocab object using 'glove.6B'
wv_dir = only if you have a custom word embedding file in .pt, .txt, .zip
wv_dim = 100, 200, 300 (provided by glove) # default=300
wv_type = 'glove.6B', 'glove.840B', 'glove.twitter.27B', 'glove.42B'
"""

########## NOTE: this external word embedding requires 800+ MB space ########## 
# 3.4.1. downloading embedding and loading into Field object 
GLOVE = data.Field()
lang2 = datasets.LanguageModelingDataset(path='movie.txt',
                                       text_field=GLOVE)
GLOVE.build_vocab(lang2, wv_type='glove.6B')

# 3.4.2. loading embedding into specific Vocab object
vocab2.load_vectors(wv_type='glove.6B', wv_dim=100)

In [ ]:

# 3.4.3. word embeddings in Vocab objects

"""
Word embedding vectors are accessible via Vocab.vectors.
They are treated as any normal FloatTensor.
"""

print("Word embedding size: ", vocab2.vectors.size())
print(vocab2.vectors[0:10])

In [ ]:

# 3.5. easily handle unknown words

"""
While you needed exceptions for dealing with unknown words in normal dictionaries,
when using a Vocab object it automatically assigns <unk> to any unknown word.
"""

unknown_word = "humbahumba"
print("Index for unknown word %s: %d" %(unknown_word, vocab2.stoi[unknown_word]))
print("Token for unknown word: ", vocab2.itos[vocab2.stoi[unknown_word]])