%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
font-size: xx-large;
text-align: left; !important
}
.rendered_html th {
font-size: xx-large;
text-align: left; !important
}
</style>
%%capture
import sys
sys.path.append("..")
import statnlpbook.util as util
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
%load_ext tikzmagic
from IPython.display import Image
import random
Transformers replace the whole LSTM with self-attention (Vaswani et al., 2017)
Deep multi-head self-attention encoder-decoder with sinusodial positional encodings:
Add residual connections, layer normalization and feed-forward layers (MLPs):
Repeat this multiple times with multiple sets of parameter matrices, then concatenate:
Use hidden representation $\mathbf{h}_i$ to create three vectors: query vector $\color{purple}{\mathbf{q}_i}=W^q\mathbf{h}_i$, key vector $\color{orange}{\mathbf{k}_i}=W^k\mathbf{h}_i$, value vector $\color{blue}{\mathbf{v}_i}=W^v\mathbf{h}_i$.
$$ \mathbf{\alpha}_{i,j} = \text{softmax}\left( \frac{\color{purple}{\mathbf{q}_i}^\intercal \color{orange}{\mathbf{k}_j}} {\sqrt{d_{\mathbf{h}}}} \right) \\ \mathbf{h}_i^\prime = \sum_{j=1}^n \mathbf{\alpha}_{i,j} \color{blue}{\mathbf{v}_j} $$$W^q$, $W^k$ and $W^v$ are all trained.
In matrix form:
$$ \text{Attention}(Q,K,V)= \text{softmax}\left( \frac{\color{purple}{Q} \color{orange}{K}^\intercal} {\sqrt{d_{\mathbf{h}}}} \right) \color{blue}{V} $$where $$ \text{head}_i=\text{Attention}(QW_i^q,KW_i^k,VW_i^v) $$
Repeat this for multiple layers, each using the previous as input:
$$ \text{MultiHead}^\ell(Q^\ell,K^\ell,V^\ell)=\text{Concat}(\text{head}_1^\ell,\ldots,\text{head}_h^\ell)W_\ell^O $$where $$ \text{head}_i^\ell=\text{Attention}(Q^\ell W_{i,\ell}^q,K^\ell W_{i,\ell}^k,V^\ell W_{i,\ell}^v) $$
RNNs process tokens sequentially, but Transformers process all tokens at once.
In fact, we did not even provide any information about the order of tokens...
Represent positions with fixed-length vectors, with the same dimensionality as word embeddings:
(1st position, 2nd position, 3rd position, ...) $\to$ Must decide on maximum sequence length
Add to word embeddings at the input layer:
Alternatives:
Model | Accuracy |
---|---|
LSTM | 77.6 |
LSTMs with conditional encoding | 80.9 |
LSTMs with conditional encoding + attention | 82.3 |
LSTMs with word-by-word attention | 83.5 |
Self-attention | 85.6 |
Attends to encoded input and to partial output.
The encoder transformer is sometimes called "bidirectional transformer".
Predict masked words given context on both sides:
Conditional encoding of both sentences:
Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.
(Many other variations available through HuggingFace Transformers)
Trained on 16GB of text from Wikipedia + BookCorpus.
Model | Accuracy |
---|---|
LSTM | 77.6 |
LSTMs with conditional encoding | 80.9 |
LSTMs with conditional encoding + attention | 82.3 |
LSTMs with word-by-word attention | 83.5 |
Self-attention | 85.6 |
BERT$_\mathrm{BASE}$ | 89.2 |
BERT$_\mathrm{LARGE}$ | 90.4 |
Same architecture as BERT but better hyperparameter tuning and more training data (Liu et al., 2019):
and no next-sentence-prediction task (only masked LM).
Training: 1024 GPUs for one day.
Model | Accuracy |
---|---|
LSTM | 77.6 |
LSTMs with conditional encoding | 80.9 |
LSTMs with conditional encoding + attention | 82.3 |
LSTMs with word-by-word attention | 83.5 |
Self-attention | 85.6 |
BERT$_\mathrm{BASE}$ | 89.2 |
BERT$_\mathrm{LARGE}$ | 90.4 |
RoBERTa$_\mathrm{BASE}$ | 90.7 |
RoBERTa$_\mathrm{LARGE}$ | 91.4 |
https://github.com/google-research/bert/blob/master/multilingual.md
(CamemBERT, BERTje, Nordic BERT...)