NLP With Transformers

HuggingFace Package

We will use the HuggingFace implementation of transformers. Since we have already installed pytorch, we can now install the transformers package by doing

  pip install transformers

It includes many pre-trained neural networks for performing a number of tasks involving natural language processing.

Sentiment Analysis

As a first example, let's detect the sentiment of a short text, using a pre-trained network.

In [4]:
import transformers as tr

sentiment = tr.pipeline('sentiment-analysis')
In [5]:
sentiment('CS440 is a great class!')
Out[5]:
[{'label': 'POSITIVE', 'score': 0.9998645186424255}]
In [6]:
sentiment('Completing my BS degree in computer science took a lot of hard work.')
Out[6]:
[{'label': 'NEGATIVE', 'score': 0.990814745426178}]
In [7]:
sentiment('But I am happy to have graduated.')
Out[7]:
[{'label': 'POSITIVE', 'score': 0.9998417496681213}]

Let's try sentences from the preface of Stuart Russels new book Human Compatible: Artificial Intelligence and the Problem of Control.

In [8]:
russell = '''
This book is about the past, present, and future of our attempt to understand and create intelligence.
This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.
The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.
We cannot predict exactly how the technology will develop or on what timeline.
Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.
What then?
Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.
The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.
'''
In [9]:
russell = [s for s in russell.split('\n') if len(s) > 0]
russell
Out[9]:
['This book is about the past, present, and future of our attempt to understand and create intelligence.',
 'This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.',
 "The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.",
 'We cannot predict exactly how the technology will develop or on what timeline.',
 'Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.',
 'What then?',
 'Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.',
 'The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.']
In [10]:
sentiment(russell)
Out[10]:
[{'label': 'POSITIVE', 'score': 0.9988692998886108},
 {'label': 'POSITIVE', 'score': 0.9920300245285034},
 {'label': 'POSITIVE', 'score': 0.9992543458938599},
 {'label': 'NEGATIVE', 'score': 0.996938943862915},
 {'label': 'NEGATIVE', 'score': 0.9982810020446777},
 {'label': 'NEGATIVE', 'score': 0.9920858144760132},
 {'label': 'POSITIVE', 'score': 0.9977427124977112},
 {'label': 'NEGATIVE', 'score': 0.9920915365219116}]
In [11]:
for words, sentiment_result in zip(russell, sentiment(russell)):
    print('\n', words)
    print('   ', sentiment_result['label'], sentiment_result['score'])
 This book is about the past, present, and future of our attempt to understand and create intelligence.
    POSITIVE 0.9988692998886108

 This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.
    POSITIVE 0.9920300245285034

 The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.
    POSITIVE 0.9992543458938599

 We cannot predict exactly how the technology will develop or on what timeline.
    NEGATIVE 0.996938943862915

 Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.
    NEGATIVE 0.9982810020446777

 What then?
    NEGATIVE 0.9920858144760132

 Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.
    POSITIVE 0.9977427124977112

 The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.
    NEGATIVE 0.9920915365219116

This model was trained with SST-2 dataset that contains 11,855 sentences from Rotten Tomatoes movie reviews.

The transformers package contains other pre-trained models, including the following one trained on multiple languages.

In [12]:
 sentiment = tr.pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')
In [13]:
sentiment(russell)
Out[13]:
[{'label': '5 stars', 'score': 0.4648197293281555},
 {'label': '3 stars', 'score': 0.42040956020355225},
 {'label': '5 stars', 'score': 0.6286720633506775},
 {'label': '3 stars', 'score': 0.3582031726837158},
 {'label': '3 stars', 'score': 0.47712787985801697},
 {'label': '1 star', 'score': 0.3162928521633148},
 {'label': '5 stars', 'score': 0.5440354943275452},
 {'label': '4 stars', 'score': 0.3630126714706421}]
In [14]:
sentiment.model
Out[14]:
BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (2): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (3): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (4): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (5): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (6): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (7): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (8): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (9): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (10): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=5, bias=True)
)

So, what is a BertModel? You can read the paper that introduced Bert here. This paper introduces the acronym 'Bert' for Bidirectional Encoder Representations from Transformers and introduces many details of the model and how it is trained.

Let's look at other NLP applications that are available with HuggingFaces.

In [15]:
m = sentiment.model
In [16]:
m.num_parameters()
Out[16]:
167360261
In [17]:
f'{m.num_parameters():,}'
Out[17]:
'167,360,261'

Summarization

In [18]:
summarize = tr.pipeline("summarization")





In [19]:
russell = '''
This book is about the past, present, and future of our attempt to understand and create intelligence.
This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.
The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.
We cannot predict exactly how the technology will develop or on what timeline.
Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.
What then?
Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.
The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.
'''
In [20]:
summarize(russell, max_length=130, min_length=30, do_sample=False)
Out[20]:
[{'summary_text': " This book is about the past, present, and future of our attempt to understand and create intelligence . The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time ."}]
In [21]:
summarize(russell, max_length=160, min_length=50, do_sample=False)
Out[21]:
[{'summary_text': " This book is about the past, present, and future of our attempt to understand and create intelligence . The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time . The purpose of the book is to explain why it might be the last event in human history and how to make sure it is not ."}]

Translation

In [22]:
translate_en_to_de = tr.pipeline('translation_en_to_de')


Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
translate_en_to_de('Hello, my name is Chuck.')
Out[23]:
[{'translation_text': 'Hallo, mein Name ist Chuck.'}]
In [24]:
translate_en_to_de(russell)
Out[24]:
[{'translation_text': 'Dieses Buch befasst sich mit der Vergangenheit, der Gegenwart und der Zukunft unseres Versuchs, Intelligenz zu verstehen und zu schaffen. Dies ist wichtig, nicht weil KI schnell zu einem allgegenwärtigen Aspekt der Gegenwart wird, sondern weil es die dominierende Technologie der Zukunft ist. Die Großmächte der Welt erwachen auf diese Tatsache, und die größten Konzerne der Welt wissen dies seit einiger Zeit. Wir können nicht genau vorhersagen,'}]

Details of Transformers

Some of these notes are modifed from Buomsoo Kim's blog, which I have found very helpful in understanding NLP history and current algorithms. This article by Samuel Lynn-Evans is also very helpful.

Take a look at manythings.

In [25]:
import io
import zipfile
import re
from tqdm import tqdm  # progress bar
import numpy as np
import torch
import matplotlib.pyplot as plt
In [26]:
!curl -O https://www.manythings.org/anki/deu-eng.zip
    
import io
import zipfile

with zipfile.ZipFile('deu-eng.zip') as zf:
    with io.TextIOWrapper(zf.open('deu.txt'), encoding="utf-8") as f:
        sentences = f.readlines()
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8129k  100 8129k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M
In [27]:
sentences[:10]
Out[27]:
['Go.\tGeh.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)\n',
 'Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n',
 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n',
 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n',
 'Run.\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)\n',
 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n',
 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n',
 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n',
 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n',
 'Help!\tZu Hülf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #2122375 (Pfirsichbaeumchen)\n']
In [28]:
MAX_N_SENTENCES = 10000
MAX_SENTENCE_LENGTH = 10
eng_sentences, deu_sentences = [], []
eng_words, deu_words = set(), set()

for i in tqdm(range(MAX_N_SENTENCES)):
      sentence_i = np.random.randint(len(sentences))
      # find only letters in sentences
      eng_sent, deu_sent = ["<sos>"], ["<sos>"]  # start of sentence tag
      eng_sent += re.findall(r"\w+", sentences[sentence_i].split("\t")[0]) 
      deu_sent += re.findall(r"\w+", sentences[sentence_i].split("\t")[1])

      # change to lowercase
      eng_sent = [x.lower() for x in eng_sent]
      deu_sent = [x.lower() for x in deu_sent]
      eng_sent.append('<eos>')  # end of sentence tag
      deu_sent.append('<eos>')

      # Add <pad> to end of sentences that are shorter than MAX_SENTENCE_LENGTH
      if len(eng_sent) >= MAX_SENTENCE_LENGTH:
        eng_sent = eng_sent[:MAX_SENTENCE_LENGTH]
      else:
        eng_sent.extend(['<pad>'] * (MAX_SENTENCE_LENGTH - len(eng_sent))) 

      if len(deu_sent) >= MAX_SENTENCE_LENGTH:
        deu_sent = deu_sent[:MAX_SENTENCE_LENGTH]
      else:
        deu_sent.extend(['<pad>'] * (MAX_SENTENCE_LENGTH - len(deu_sent)))

      # add parsed sentences
      eng_sentences.append(eng_sent)
      deu_sentences.append(deu_sent)

      # update unique words
      eng_words.update(eng_sent)
      deu_words.update(deu_sent)
        
eng_words, deu_words = list(eng_words), list(deu_words)
100%|██████████| 10000/10000 [00:00<00:00, 62761.45it/s]
In [29]:
len(eng_words), len(deu_words)
Out[29]:
(4546, 6913)
In [30]:
for i in range(10):
    print()
    print(eng_sentences[i])
    print(deu_sentences[i])
['<sos>', 'tom', 'can', 't', 'believe', 'mary', 'let', 'herself', 'get', 'caught']
['<sos>', 'tom', 'kann', 'nicht', 'glauben', 'dass', 'maria', 'sich', 'erwischen', 'ließ']

['<sos>', 'even', 'though', 'tom', 'just', 'had', 'his', 'fortieth', 'birthday', 'i']
['<sos>', 'tom', 'hatte', 'zwar', 'gerade', 'seinen', 'vierzigsten', 'geburtstag', 'ich', 'glaube']

['<sos>', 'she', 'is', '35', 'years', 'old', 'and', 'in', 'the', 'prime']
['<sos>', 'sie', 'ist', '35', 'und', 'in', 'ihren', 'besten', 'jahren', '<eos>']

['<sos>', 'guess', 'who', 'i', 'am', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>']
['<sos>', 'rate', 'wer', 'ich', 'bin', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>']

['<sos>', 'even', 'tom', 'doesn', 't', 'know', 'mary', '<eos>', '<pad>', '<pad>']
['<sos>', 'selbst', 'tom', 'kennt', 'maria', 'nicht', '<eos>', '<pad>', '<pad>', '<pad>']

['<sos>', 'tom', 'is', 'drinking', 'a', 'beer', '<eos>', '<pad>', '<pad>', '<pad>']
['<sos>', 'tom', 'trinkt', 'ein', 'bier', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>']

['<sos>', 'i', 'work', 'as', 'many', 'hours', 'as', 'you', 'do', '<eos>']
['<sos>', 'ich', 'arbeite', 'gleich', 'viele', 'stunden', 'wie', 'du', '<eos>', '<pad>']

['<sos>', 'the', 'students', 'couldn', 't', 'answer', '<eos>', '<pad>', '<pad>', '<pad>']
['<sos>', 'antworten', 'konnten', 'die', 'studenten', 'nicht', '<eos>', '<pad>', '<pad>', '<pad>']

['<sos>', 'i', 'want', 'a', 'new', 'kitchen', '<eos>', '<pad>', '<pad>', '<pad>']
['<sos>', 'ich', 'will', 'eine', 'neue', 'küche', '<eos>', '<pad>', '<pad>', '<pad>']

['<sos>', 'i', 'll', 'dream', 'about', 'you', '<eos>', '<pad>', '<pad>', '<pad>']
['<sos>', 'ich', 'werde', 'von', 'dir', 'träumen', '<eos>', '<pad>', '<pad>', '<pad>']

Convert each word into an integer index.

In [31]:
for i in tqdm(range(len(eng_sentences))):
  eng_sentences[i] = [eng_words.index(x) for x in eng_sentences[i]]
  deu_sentences[i] = [deu_words.index(x) for x in deu_sentences[i]]
100%|██████████| 10000/10000 [00:09<00:00, 1024.66it/s]
In [32]:
i = 10
print(eng_sentences[i])
print([eng_words[x] for x in eng_sentences[i]])
print(deu_sentences[i])
print([deu_words[x] for x in deu_sentences[i]])  
[3201, 4155, 4116, 1527, 2592, 1849, 1427, 1119, 3499, 1480]
['<sos>', 'tom', 'paid', 'a', 'lot', 'of', 'money', 'for', 'that', 'guitar']
[6383, 3308, 4227, 3056, 2729, 4397, 2640, 1213, 303, 2962]
['<sos>', 'tom', 'hat', 'einen', 'haufen', 'geld', 'für', 'diese', 'gitarre', 'bezahlt']
In [33]:
class TransformerNet(torch.nn.Module):

  def __init__(self, X_vocab_size, T_vocab_size, embedding_dim, n_hiddens, n_head, n_layers, dropout):
    super().__init__()

    self.enc_embedding = torch.nn.Embedding(X_vocab_size, embedding_dim)
    self.dec_embedding = torch.nn.Embedding(T_vocab_size, embedding_dim)
    self.transformer = torch.nn.Transformer(d_model = embedding_dim, nhead = n_head,
                                            num_encoder_layers=n_layers, num_decoder_layers = n_layers,
                                            dim_feedforward = n_hiddens, dropout = dropout)
    self.dense = torch.nn.Linear(embedding_dim, T_vocab_size)
    self.log_softmax = torch.nn.LogSoftmax(dim=2)

  def forward(self, X, T):
    src = self.enc_embedding(X)
    tgt = self.dec_embedding(T)
    Y = self.transformer(src, tgt)
    return self.log_softmax(self.dense(Y))
In [34]:
ENG_VOCAB_SIZE = len(eng_words)
DEU_VOCAB_SIZE = len(deu_words)
HIDDEN_SIZE = 16
EMBEDDING_DIM = 30
NUM_HEADS = 2
NUM_LAYERS = 3
DROPOUT = True
DEVICE = 'cpu'  # torch.device('cuda') 

NUM_EPOCHS = 200
LEARNING_RATE = 1e-2
BATCH_SIZE = 128
In [35]:
n = 500
X = torch.tensor(eng_sentences[:n])
T = torch.tensor(deu_sentences[:n])
In [36]:
model = TransformerNet(ENG_VOCAB_SIZE, DEU_VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_SIZE, NUM_HEADS, NUM_LAYERS, DROPOUT).to(DEVICE)

nll_f = torch.nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)
In [37]:
model
Out[37]:
TransformerNet(
  (enc_embedding): Embedding(4546, 30)
  (dec_embedding): Embedding(6913, 30)
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
        )
        (1): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
        )
        (2): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
        )
      )
      (norm): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
          (dropout3): Dropout(p=True, inplace=False)
        )
        (1): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
          (dropout3): Dropout(p=True, inplace=False)
        )
        (2): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=30, out_features=30, bias=True)
          )
          (linear1): Linear(in_features=30, out_features=16, bias=True)
          (dropout): Dropout(p=True, inplace=False)
          (linear2): Linear(in_features=16, out_features=30, bias=True)
          (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=True, inplace=False)
          (dropout2): Dropout(p=True, inplace=False)
          (dropout3): Dropout(p=True, inplace=False)
        )
      )
      (norm): LayerNorm((30,), eps=1e-05, elementwise_affine=True)
    )
  )
  (dense): Linear(in_features=30, out_features=6913, bias=True)
  (log_softmax): LogSoftmax()
)
In [38]:
list(model.children())[0]
Out[38]:
Embedding(4546, 30)
In [39]:
len(eng_words)
Out[39]:
4546
In [40]:
list(model.children())[1]
Out[40]:
Embedding(6913, 30)
In [41]:
len(deu_words)
Out[41]:
6913
In [42]:
n_samples = X.shape[0]
likelihood_trace = []
for epoch in tqdm(range(NUM_EPOCHS)):

    # Must train in batches to avoid exceeding memory capacity and crashing python.
    loss = 0
    for first_sample in range(0, n_samples, BATCH_SIZE):

        last_sample = min(n_samples - 1, first_sample + BATCH_SIZE)
        use_rows = slice(first_sample, last_sample)
        Y = model(X[use_rows, :], T[use_rows, :])
        nll = nll_f(Y.permute(0, 2, 1), T[use_rows, :])
        optimizer.zero_grad()
        nll.backward()
        optimizer.step()
        loss += nll
    likelihood_trace.append((-loss).exp())
100%|██████████| 200/200 [01:15<00:00,  2.66it/s]
In [43]:
plt.plot(range(1, NUM_EPOCHS + 1), likelihood_trace)
plt.xlabel('Epoch')
plt.ylabel('Likelihood')
Out[43]:
Text(0, 0.5, 'Likelihood')
In [44]:
def encoding_to_words(sentence, vocab):
    return ' '.join(filter(lambda x: x != '<sos>' and x != '<pad>' and x != '<eos>', 
                           [vocab[x] for x in sentence]))
In [45]:
n = 10
with torch.no_grad():
    Y = model(X[:n, :], T[:n, :])
    Y_sentences =  Y.argmax(-1)
In [46]:
Y_sentences[:5]
Out[46]:
tensor([[6383, 3308, 2704, 4119, 3575, 5705,  349, 1431, 2751, 2738],
        [6383, 3308, 6167,  869, 3138, 4353,  795, 3541, 5910, 4721],
        [6383, 1809, 6405, 2358, 5847, 4792, 4020, 6196, 2702, 4842],
        [6383, 5213, 3945, 5910, 1813, 4842, 4525, 4525, 4525, 4525],
        [6383, 4867, 3308, 4013,  349, 4119, 4842, 4525, 4525, 4525]])
In [47]:
for i in range(n):
    print()
    print('    Input:', encoding_to_words(X[i], eng_words))
    print('Predicted:', encoding_to_words(Y_sentences[i], deu_words))
    print('   Target:', encoding_to_words(T[i], deu_words))
    Input: tom can t believe mary let herself get caught
Predicted: tom kann nicht glauben dass maria sich erwischen ließ
   Target: tom kann nicht glauben dass maria sich erwischen ließ

    Input: even though tom just had his fortieth birthday i
Predicted: tom hatte zwar gerade seinen vierzigsten geburtstag ich glaube
   Target: tom hatte zwar gerade seinen vierzigsten geburtstag ich glaube

    Input: she is 35 years old and in the prime
Predicted: sie ist 35 und in ihren besten jahren
   Target: sie ist 35 und in ihren besten jahren

    Input: guess who i am
Predicted: rate wer ich bin
   Target: rate wer ich bin

    Input: even tom doesn t know mary
Predicted: selbst tom kennt maria nicht
   Target: selbst tom kennt maria nicht

    Input: tom is drinking a beer
Predicted: tom trinkt ein bier
   Target: tom trinkt ein bier

    Input: i work as many hours as you do
Predicted: ich arbeite gleich viele stunden wie du
   Target: ich arbeite gleich viele stunden wie du

    Input: the students couldn t answer
Predicted: antworten konnten die studenten nicht
   Target: antworten konnten die studenten nicht

    Input: i want a new kitchen
Predicted: ich will eine neue küche
   Target: ich will eine neue küche

    Input: i ll dream about you
Predicted: ich werde von dir träumen
   Target: ich werde von dir träumen

Let's try some sentences that were not part of the training data.

In [48]:
n = 10
Xtest = torch.tensor(eng_sentences[-n:])
Ttest = torch.tensor(deu_sentences[-n:])

n = 10
with torch.no_grad():
    Ytest = model(Xtest[:n, :], Ttest[:n, :])
    Ytest_sentences =  Ytest.argmax(-1)
In [49]:
for i in range(n):
    print()
    print('    Input:', encoding_to_words(Xtest[i], eng_words))
    print('Predicted:', encoding_to_words(Ytest_sentences[i], deu_words))
    print('   Target:', encoding_to_words(Ttest[i], deu_words))
    Input: she accused him of stealing her money
Predicted: sie scheine ihn ihr geld möglicherweise zu haben
   Target: sie beschuldigte ihn ihr geld gestohlen zu haben

    Input: tom has a foreign car
Predicted: tom hat ein nur auto
   Target: tom hat ein ausländisches auto

    Input: we have more in common than i thought
Predicted: wir haben mehr wer als ich dachte
   Target: wir haben mehr gemeinsamkeiten als ich dachte

    Input: please answer this question for me
Predicted: bitte sie diese frage für mich
   Target: bitte beantworten sie diese frage für mich

    Input: my sister takes a shower every morning
Predicted: meine schwester betrunken jeden morgen
   Target: meine schwester duscht jeden morgen

    Input: tom is healthy
Predicted: tom ist frau
   Target: tom ist gesund

    Input: i don t care about your past
Predicted: ihre kein mich nicht
   Target: ihre vergangenheit interessiert mich nicht

    Input: tom tried to discourage mary from going out with
Predicted: tom finde maria davon wecke mit wie mathematische
   Target: tom versuchte maria davon abzubringen mit johannes auszugehen

    Input: we both know it s too late
Predicted: wir wissen beide dass es zu hemd ist
   Target: wir wissen beide dass es zu spät ist

    Input: where s your money
Predicted: wo ist ihr geld
   Target: wo ist ihr geld