Transformer Sequence-to-Sequence Model

Here we showcase a vanilla transformer model from the paper "Attention is all you need" (Vaswani et al. 2017) build with both encoder and decoder layers trained on English to French translation dataset.

In [0]:
!pip install pandas==0.24.0 -q
In [0]:
# As python modules are not captured in Colab, we manually copy the code to the GPU instance.

!wget -q
!unzip -qq
!mkdir transformer
!mv pytorch-nlp-notebooks-develop/transformer/* transformer/
!rm -r pytorch-nlp-notebooks-develop
In [0]:
import numpy as np
import copy
import time
from pathlib import Path

from google_drive_downloader import GoogleDriveDownloader as gdd
from tqdm import tqdm_notebook, tqdm

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from import Dataset, DataLoader, Subset
from import random_split

# Check out the model architecture in transformer folder
from transformer.model import Transformer
from transformer.batch import *


# Show better CUDA error messages

In order to perform deep learning on a GPU (so that everything runs super quick!), CUDA has to be installed and configured. Fortunately, Google Colab already has this set up, but if you want to try this on your own GPU, you can install CUDA from here. Make sure you also install cuDNN for optimized performance.

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Download the data

We will download a dataset of English-to-French translations from a public Google Drive folder.

In [5]:
DATA_PATH = 'data/english_to_french.txt'
if not Path(DATA_PATH).is_file():
Downloading 1Jf7QoW2NK6_ayEXZji6DAXDSIRMvapm3 into data/english_to_french.txt... Done.
In [6]:
dataset = EnglishFrenchTranslations(DATA_PATH, max_vocab=1000, max_seq_len=100)
In [0]:
# Get the indicies of special tokens 
SRC_VOCAB = dataset.token2idx_inputs
TRG_VOCAB = dataset.token2idx_targets
src_pad = torch.tensor(SRC_VOCAB[dataset.padding_token]).to(device)
src_sos = torch.tensor(SRC_VOCAB[dataset.start_of_sequence_token]).to(device)
src_eos = torch.tensor(SRC_VOCAB[dataset.end_of_sequence_token]).to(device)
trg_pad = torch.tensor(TRG_VOCAB[dataset.padding_token]).to(device)
trg_sos = torch.tensor(TRG_VOCAB[dataset.start_of_sequence_token]).to(device)
trg_eos = torch.tensor(TRG_VOCAB[dataset.end_of_sequence_token]).to(device)

Split into training and test set

In [0]:
train_size = int(0.9999 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset =, [train_size, test_size])

Batching - Create data generators using DataLoader

In [0]:
batch_size = 256
train_loader = DataLoader(
    collate_fn=lambda batch: collate(batch, src_pad, trg_pad, device),

test_loader = DataLoader(
    collate_fn=lambda batch: collate(batch, src_pad, trg_pad, device),

Define the Transformer model

In [10]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='images/scaled_dot_product_attention.png'></td><td><img src='images/multi_head_attention.png'></td></tr></table>"))
display(HTML("<center><img src='images/transformer.png'><center>"))
In [0]:
src_vocab_size = len(dataset.token2idx_inputs)
trg_vocab_size = len(dataset.token2idx_targets)
heads = 4
N = 1
d_model = 32 * heads
dropout = 0.1

model = Transformer(src_vocab_size, trg_vocab_size, d_model, N, heads, dropout).to(device)
In [12]:
param_sizes = [list(p.size()) for p in model.parameters() if p.requires_grad]
total_params = np.sum([ for size in param_sizes])
print('Number of sub-layers:', len(param_sizes))
print('Total number of trainable parameters:', total_params)
Number of sub-layers: 50
Total number of trainable parameters: 1637864

Define loss function and optimizer

In [0]:
criterion = nn.CrossEntropyLoss(ignore_index=trg_pad)

def loss_function(pred, real):
    # Use mask to only consider non-zero inputs in the loss
    # .ge(x) --> binary valued matrix of value > x
    mask =
    loss_ = criterion(pred, real) * mask 
    return torch.mean(loss_)

optimizer = optim.Adam([p for p in model.parameters() if p.requires_grad], lr=0.001)


In [0]:
def train(train_loader, model, epochs, optimizer):
    start = time.time()
    for epoch in range(epochs):
        total_loss = total = 0
        progress_bar = tqdm_notebook(train_loader, desc='Training', leave=False)
        for inputs, targets, lengths in progress_bar:
            # Clean old gradients
            # Create source & target sequence masks
            src_mask, trg_mask = create_masks(
                targets[:, :-1], 
            # Forwards pass, output: (batch_size, seq_len, target_vocab)
            output = model(inputs, targets[:, :-1], src_mask, trg_mask)
            # Offset target by 1 position
            # Pred: (N, C) | y: (N,)
            # Instead of softmax, user linear output as loss is easier to see 
            pred = output.view(-1, output.size(-1))
            y = targets[:, 1:].contiguous().view(-1)
            # Compute loss
            loss = loss_function(pred, y)
            # Perform gradient descent, backwards pass

            # Take a step in the right direction

            # Record metrics
            #print('current batch loss:', round(loss.item(), 3))
            total_loss += loss.item()
            total += targets.size(0)

        train_loss = total_loss / total

        tqdm.write(f'epoch #{epoch + 1:3d}\ttrain_loss: {train_loss:.2e}\n')
In [15]:
epochs = 20
train(train_loader, model, epochs, optimizer)
epoch #  1	train_loss: 1.18e-02

epoch #  2	train_loss: 6.53e-03

epoch #  3	train_loss: 5.82e-03

epoch #  4	train_loss: 5.45e-03

epoch #  5	train_loss: 5.22e-03

epoch #  6	train_loss: 5.07e-03

epoch #  7	train_loss: 4.96e-03

epoch #  8	train_loss: 4.96e-03

epoch #  9	train_loss: 4.93e-03

epoch # 10	train_loss: 4.91e-03

epoch # 11	train_loss: 4.96e-03

epoch # 12	train_loss: 4.87e-03

epoch # 13	train_loss: 4.88e-03

epoch # 14	train_loss: 5.01e-03

epoch # 15	train_loss: 5.09e-03

epoch # 16	train_loss: 4.98e-03

epoch # 17	train_loss: 5.03e-03

epoch # 18	train_loss: 5.08e-03

epoch # 19	train_loss: 5.13e-03

epoch # 20	train_loss: 5.13e-03

Let's translate with some test data

On prediction, the model outputs probability of word on each position one by one. On each step, beam search is performed to keep only the top k sequences with highest accumlated log likelihood.

In [0]:
def translate(test_loader, src_vocab, trg_vocab, sos, pad, eos, max_seq_len, beam_size):
    total_loss = total = 0
    with torch.no_grad():
        for inputs, targets, lengths in test_loader:
            print('>', ' '.join([
                src_vocab[int(idx)] for idx in inputs.cpu()[0].data[1:-1]
            # Forwards pass
            outputs = model.predict(inputs, sos, pad, eos, max_seq_len, beam_size)
            print(' '.join([
                trg_vocab[int(idx)] for idx in outputs[0].data

In [17]:
> at what time does it close
elle a t il y a pas de temps

> close your eyes again
ils sont ils ont peur de votre nom

> i want to believe you
c est ce que vous voulez

> you have to let me help
vous devez venir avec moi

> let s go by bus
tout le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde le monde

In [0]: