%%html
<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
font-size: xx-large;
text-align: left; !important
}
.rendered_html th {
font-size: xx-large;
text-align: left; !important
}
</style>
%%capture
import sys
sys.path.append("..")
import statnlpbook.util as util
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
%load_ext tikzmagic
(Source: freeCodeCamp)
Four paradigms:
An informal but entertaining overview: "A history of machine translation from the Cold War to deep learning"
Details about SMT in previous slides from this repo
Sequence-to-sequence model (seq2seq), encoder–decoder architecture (Sutskever et al., 2014)
(Examples are Basque–English)
Many things could go wrong.
Output words depend on each other!
We added an embedding layer to the decoder
At the first timestep, we use a special start symbol (here: <S>
) as input
We feed the predictions back into the next timestep:
Loss function: negative log-likelihood
Teacher forcing: always feed the ground truth into the decoder
Alternative:
Vocabulary size becomes a problem $\rightarrow$ Have to do softmax over all words in the target language!
Solution 1: restrict the vocabulary
<UNK>
symbol
Solution 2: subword tokenization (see §13.6.2 in Koehn)
$\rightarrow$ We don't know how long the output sequence is going to be!
</S>
, <EOS>
, <END>
, ...For machine translation:
</S>
symbol is predictedRecall:
Greedy decoding may lead to search errors when returned $\y$ is not highest scoring global solution
With input feeding, future predictions will change based on previous ones
Exhaustive search not feasible
Popular solution:
Keep a "beam" of the best $\beta$ previous solutions
Beam size $\beta = 3$
$y_0$ | $y$ | $\log p_\theta(y |
---|---|---|
<S> | I | -1.670 |
<S> | We | -3.266 |
<S> | He | -3.364 |
<S> | She | -3.366 |
<S> | They | -4.920 |
<S> | ... | ... |
Beam size $\beta = 3$
$y_0$ | $y$ | $\log p_\theta(y |
---|---|---|
<S> | I | -1.670 |
<S> | We | -3.266 |
<S> | He | -3.364 |
<S> | She | -3.366 |
<S> | They | -4.920 |
<S> | ... | ... |
Beam size $\beta = 3$
$y_0,y_1$ | $y$ | $\log p_\theta(y | \x,\langle S \rangle \text{ I})$ | $y_0,y_1$ | $y$ | $\log p_\theta(y | \x,\langle S \rangle \text{ We})$ | $y_0,y_1$ | $y$ | $\log p_\theta(y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
<S> I | love | -1.141 | <S> We | love | -1.129 | <S> He | loves | -2.916 | ||||
<S> I | like | -1.673 | <S> We | like | -2.367 | <S> He | will | -4.267 | ||||
<S> I | enjoy | -3.906 | <S> We | have | -2.904 | <S> He | likes | -4.619 | ||||
<S> I | have | -4.366 | <S> We | enjoy | -4.148 | <S> He | loved | -5.698 | ||||
<S> I | ... | ... | <S> We | ... | ... | <S> He | ... | ... |
Beam size $\beta = 3$
$y_0,y_1$ | $y$ | $\log p_\theta(y | \x,\langle S \rangle \text{ I})-1.670$ | $y_0,y_1$ | $\y$ | $\log p_\theta(y | \x,\langle S \rangle \text{ We})-3.266$ | $y_0,y_1$ | $\y$ | $\log p_\theta(y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
<S> I | love | -2.811 | <S> We | love | -4.395 | <S> He | loves | -6.280 | ||||
<S> I | like | -3.343 | <S> We | like | -5.633 | <S> He | will | -7.631 | ||||
<S> I | enjoy | -5.576 | <S> We | have | -6.170 | <S> He | likes | -7.983 | ||||
<S> I | have | -6.036 | <S> We | enjoy | -7.414 | <S> He | loved | -9.062 | ||||
<S> I | ... | ... | <S> We | ... | ... | <S> He | ... | ... |
because $\log p_\theta(y_0,y_1,y|\x) = \log p_\theta(y|\x,y_0, y_1) + \log p_\theta(y_0,y_1|\x)$ (Bayes' rule)
Beam size $\beta = 3$
$y_0,y_1$ | $y$ | $\log p_\theta(\y | \x)$ | $y_0$ | $\y$ | $\log p_\theta(\y | \x)$ | $y_0$ | $\y$ | $\log p_\theta(\y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
<S> I | love | -2.811 | <S> We | love | -4.395 | <S> He | loves | -6.280 | ||||
<S> I | like | -3.343 | <S> We | like | -5.633 | <S> He | will | -7.631 | ||||
<S> I | enjoy | -5.576 | <S> We | have | -6.170 | <S> He | likes | -7.983 | ||||
<S> I | have | -6.036 | <S> We | enjoy | -7.414 | <S> He | loved | -9.062 | ||||
<S> I | ... | ... | <S> We | ... | ... | <S> He | ... | ... |
</S>
We're training the model with negative log-likelihood, but that's not the best way to evaluate it.
Consider:
In machine translation, there are often several acceptable variations!
A widely used, but simplistic, metric:
The BLEU score will range between 0 (no match at all) and 1.0 (perfect match, 100%)
from nltk.translate.bleu_score import sentence_bleu
refs = [["After", "lunch", ",", "he", "went", "to", "the", "gym", "."],
["He", "went", "to", "the", "gym", "after", "lunch", "."]]
sentence_bleu(refs, ["After", "lunch", ",", "he", "went", "to", "the", "gym", "."])
1.0
sentence_bleu(refs, ["Turtles", "are", "great", "animals", "to", "the", "gym", "."])
0.345720784641941
sentence_bleu(refs, ["After", "he", "had", "lunch", ",", "he", "went", "to", "the", "gym", "."])
0.6989307622784944
sentence_bleu(refs, ["After", "lunch", ",", "he", "went", "to", "the", "pizzeria", "."])
0.7506238537503395
sentence_bleu(refs, ["Before", "lunch", ",", "he", "went", "to", "the", "gym", "."])
0.8633400213704505
t
Beam search vs. greedy decoding
Evaluation with BLEU
Competitive machine translation models are very time-intensive to train!
Example: Wu et al. (2016) describe Google's NMT system
Encoder–decoder with attention & stack of 8 LSTM layers (plus some other additions)
36 million sentence pairs for English-to-French setting (En→Fr)
Quote:
On WMT En→Fr, it takes around 6 days to train a basic model using 96 NVIDIA K80 GPUs.
From a paper recently submitted to ICLR 2020:
We use a single transformer model to fit all the training data. We use 512 Nvidia V100 GPUs with mini-batches of approximately 1M tokens. [...] Upon the submission of this paper, training has lasted for three months, 2 epochs in total, and perplexity on the development set is still dropping.
Non-neural machine translation:
Sequence-to-sequence models:
And beyond...
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/encdec_rnn1.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=1.5em,minimum width=2em,
# inner xsep=1em,inner ysep=0.5em},
# addstate/.style={draw,circle,inner sep=0.1em,fill=gray!10},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.2em},
# encembed/.style={fill=green!40!gray!40},
# decembed/.style={fill=blue!40!gray!40},
# encoder/.style={fill=green!40!gray!40},
# decoder/.style={fill=blue!40!gray!40},
# outer/.style={outer sep=0},
# label/.style={align=center,font=\bfseries\small\sffamily,text height=0.5em}}
#
# % input labels
# \foreach \i [count=\step from 1] in {Musika,maite,dut,{$<$/S$>$}} {
# \node[emptystate] (EncI\step) at (1.5*\step-1.5, 0) {\i};
# }
#
# % embedding layers
# \foreach \step in {1,...,4} {
# \node[state,encembed] (EncE\step) at (1.5*\step-1.5, 1.5) {};
# \draw[->] (EncI\step) to (EncE\step);
# }
#
# % encoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,encoder] (EncLA\step) at (1.5*\step-1.5, 2.5) {};
# \draw[->] (EncE\step) to (EncLA\step);
# \coordinate[below=0.1 of EncLA\step.east] (EncLA_be\step);
# \coordinate[below=0.1 of EncLA\step.west] (EncLA_bw\step);
# \coordinate[above=0.1 of EncLA\step.east] (EncLA_ae\step);
# \coordinate[above=0.1 of EncLA\step.west] (EncLA_aw\step);
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (EncLA_be\step) to (EncLA_bw\next);
# \draw[densely dashed, ->] (EncLA_aw\next) to (EncLA_ae\step);
# }
#
# % encoded vectors
# \node[addstate] (EncVecA) at (6.0, 2.5) {$\oplus$};
# \draw[densely dashed, ->] (EncLA_be4) to (EncVecA);
# \draw[densely dashed, ->, rounded corners=5pt] (EncLA_aw1) -|([shift={(-5mm,3mm)}]EncLA1.north west) -- ([shift={(-10mm,4.2mm)}]EncVecA.north west) to (EncVecA.north west);
#
#
# % decoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecLA\step) at (1.5*\step+6.0, 2.5) {};
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (DecLA\step.east) to (DecLA\next.west);
# }
# \draw[densely dashed, ->] (EncVecA.east) to (DecLA1.west);
#
# % dense layer
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecD\step) at (1.5*\step+6.0, 3.5) {};
# \draw[->] (DecLA\step) to (DecD\step);
# }
#
# % output labels
# \foreach \i [count=\step from 1] in {I,love,music,{$<$/S$>$}} {
# \node[emptystate] (DecO\step) at (1.5*\step+6.0, 5.0) {\i};
# \draw[->] (DecD\step) to (DecO\step);
# }
#
# % figure labels
# \node[label,text=blue!40!black!60] (DecLabel) at (5.25, 3.75) {(Uni-)LSTM Decoder};
# \node[label,text=green!40!black!60] (EncLabel) at (1.0, 3.75) {Bi-LSTM Encoder};
# \node[label,text=black!60] (asd) at (6.0, 1.5) {Sentence\\Vector};
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/encdec_rnn2.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=1.5em,minimum width=2em,
# inner xsep=1em,inner ysep=0.5em},
# addstate/.style={draw,circle,inner sep=0.1em,fill=gray!10},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.2em},
# encembed/.style={fill=green!40!gray!40},
# decembed/.style={fill=blue!40!gray!40},
# encoder/.style={fill=green!40!gray!40},
# decoder/.style={fill=blue!40!gray!40},
# outer/.style={outer sep=0},
# label/.style={align=center,font=\bfseries\small\sffamily,text height=0.5em}}
#
# % input labels
# \foreach \i [count=\step from 1] in {Musika,maite,dut,{$<$/S$>$}} {
# \node[emptystate] (EncI\step) at (1.5*\step-1.5, 0) {\i};
# }
# \foreach \i [count=\step from 1] in {{$<$S$>$},I,love,music} {
# \node[emptystate] (DecI\step) at (1.5*\step+6.0, 0) {\i};
# }
#
# % embedding layers
# \foreach \step in {1,...,4} {
# \node[state,encembed] (EncE\step) at (1.5*\step-1.5, 1.5) {};
# \node[state,decembed] (DecE\step) at (1.5*\step+6.0, 1.5) {};
# \draw[->] (EncI\step) to (EncE\step);
# \draw[->] (DecI\step) to (DecE\step);
# }
#
# % encoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,encoder] (EncLA\step) at (1.5*\step-1.5, 2.5) {};
# \draw[->] (EncE\step) to (EncLA\step);
# \coordinate[below=0.1 of EncLA\step.east] (EncLA_be\step);
# \coordinate[below=0.1 of EncLA\step.west] (EncLA_bw\step);
# \coordinate[above=0.1 of EncLA\step.east] (EncLA_ae\step);
# \coordinate[above=0.1 of EncLA\step.west] (EncLA_aw\step);
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (EncLA_be\step) to (EncLA_bw\next);
# \draw[densely dashed, ->] (EncLA_aw\next) to (EncLA_ae\step);
# }
#
# % encoded vectors
# \node[addstate] (EncVecA) at (6.0, 2.5) {$\oplus$};
# \draw[densely dashed, ->] (EncLA_be4) to (EncVecA);
# \draw[densely dashed, ->, rounded corners=5pt] (EncLA_aw1) -|([shift={(-5mm,3mm)}]EncLA1.north west) -- ([shift={(-10mm,4.2mm)}]EncVecA.north west) to (EncVecA.north west);
#
#
# % decoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecLA\step) at (1.5*\step+6.0, 2.5) {};
# \draw[->] (DecE\step) to (DecLA\step);
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (DecLA\step.east) to (DecLA\next.west);
# }
# \draw[densely dashed, ->] (EncVecA.east) to (DecLA1.west);
#
# % dense layer
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecD\step) at (1.5*\step+6.0, 3.5) {};
# \draw[->] (DecLA\step) to (DecD\step);
# }
#
# % output labels
# \foreach \i [count=\step from 1] in {I,love,music,{$<$/S$>$}} {
# \node[emptystate] (DecO\step) at (1.5*\step+6.0, 5.0) {\i};
# \draw[->] (DecD\step) to (DecO\step);
# }
#
# % input-feeding connections
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \node[emptystate] (Midway\step) at (1.5*\step+6.75, 3.75) {};
# \draw[densely dotted, ->, rounded corners=2pt] (DecO\step.east) -|(1.5*\step+6.75, 3.75) |-(DecI\next.west);
# }
#
# % figure labels
# \node[label,text=blue!40!black!60] (DecLabel) at (5.25, 3.75) {(Uni-)LSTM Decoder};
# \node[label,text=green!40!black!60] (EncLabel) at (1.0, 3.75) {Bi-LSTM Encoder};
# \node[label,text=black!60] (asd) at (6.0, 1.5) {Sentence\\Vector};
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/encdec_rnn3.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=1.5em,minimum width=2em,
# inner xsep=1em,inner ysep=0.5em},
# addstate/.style={draw,circle,inner sep=0.1em,fill=gray!10},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.2em},
# encembed/.style={fill=green!40!gray!40},
# decembed/.style={fill=blue!40!gray!40},
# encoder/.style={fill=green!40!gray!40},
# decoder/.style={fill=blue!40!gray!40},
# outer/.style={outer sep=0},
# label/.style={align=center,font=\bfseries\small\sffamily,text height=0.5em}}
#
# % input labels
# \foreach \i [count=\step from 1] in {Musika,maite,dut,{$<$/S$>$}} {
# \node[emptystate] (EncI\step) at (1.5*\step-1.5, 0) {\i};
# }
# \foreach \i [count=\step from 1] in {{$<$S$>$},I,love,music} {
# \node[emptystate] (DecI\step) at (1.5*\step+6.0, 0) {\i};
# }
#
# % embedding layers
# \foreach \step in {1,...,4} {
# \node[state,encembed] (EncE\step) at (1.5*\step-1.5, 1.5) {};
# \node[state,decembed] (DecE\step) at (1.5*\step+6.0, 1.5) {};
# \draw[->] (EncI\step) to (EncE\step);
# \draw[->] (DecI\step) to (DecE\step);
# }
#
# % encoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,encoder] (EncLA\step) at (1.5*\step-1.5, 2.5) {};
# \draw[->] (EncE\step) to (EncLA\step);
# \coordinate[below=0.1 of EncLA\step.east] (EncLA_be\step);
# \coordinate[below=0.1 of EncLA\step.west] (EncLA_bw\step);
# \coordinate[above=0.1 of EncLA\step.east] (EncLA_ae\step);
# \coordinate[above=0.1 of EncLA\step.west] (EncLA_aw\step);
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (EncLA_be\step) to (EncLA_bw\next);
# \draw[densely dashed, ->] (EncLA_aw\next) to (EncLA_ae\step);
# }
#
# % encoded vectors
# \node[addstate] (EncVecA) at (6.0, 2.5) {$\oplus$};
# \draw[densely dashed, ->] (EncLA_be4) to (EncVecA);
# \draw[densely dashed, ->, rounded corners=5pt] (EncLA_aw1) -|([shift={(-5mm,3mm)}]EncLA1.north west) -- ([shift={(-10mm,4.2mm)}]EncVecA.north west) to (EncVecA.north west);
#
#
# % decoder LSTMs
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecLA\step) at (1.5*\step+6.0, 2.5) {};
# \draw[->] (DecE\step) to (DecLA\step);
# }
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (DecLA\step.east) to (DecLA\next.west);
# }
# \draw[densely dashed, ->] (EncVecA.east) to (DecLA1.west);
#
# % dense layer
# \foreach \step in {1,...,4} {
# \node[state,decoder] (DecD\step) at (1.5*\step+6.0, 3.5) {};
# \draw[->] (DecLA\step) to (DecD\step);
# }
#
# % output labels
# \foreach \i [count=\step from 1] in {I,love,music,{$<$/S$>$}} {
# \node[emptystate] (DecO\step) at (1.5*\step+6.0, 5.0) {\i};
# \draw[->] (DecD\step) to (DecO\step);
# }
#
# % input-feeding connections
# \foreach \step in {1,...,3} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \node[emptystate] (Midway\step) at (1.5*\step+6.75, 3.75) {};
# \draw[densely dotted, ->, rounded corners=2pt] (DecO\step.east) -|(1.5*\step+6.75, 3.75) |-(DecI\next.west);
# }
#
# % figure labels
# \node[label,text=red!80!black!60] (asd) at (5.0, 4.5) {This is a\\bottleneck!};
# \draw[->,color=red!80!black!60] (asd) to (EncVecA);
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/encdec.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=2em,minimum width=3.5em,
# inner xsep=1em,inner ysep=0.5em,text height=1em,text depth=0.15em},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.2em},
# encoder/.style={minimum width=6cm,minimum height=1.5cm,fill=green!40!gray!40,font=\bfseries\sffamily},
# decoder/.style={minimum width=6cm,minimum height=1.5cm,fill=blue!40!gray!40,font=\bfseries\sffamily},
# outer/.style={outer sep=0},
# label/.style={align=center,font=\itshape\small}}
#
# \node[emptystate] (I1) at (2.5, 0) {{Musika maite dut}};
#
# \node[state,encoder] (ENC) at (2.5, 2) {{Encoder}};
# \draw [->] (I1) to (ENC.south);
#
# \node[emptystate] (O1) at (11, 4) {I love music};
#
# \node[state,decoder] (DEC) at (11, 2) {{Decoder}};
# \draw [->] (DEC.north) to (O1);
#
# \draw [->] (ENC) to (DEC);
#
# \node[emptystate] (V) at (4.0, 4.0) {\sffamily\textit{vector representation}};
#
# \node[emptystate] (ED) at (6.75, 1.75) {};
# \draw [densely dashed] (V.east) to[out=0,in=90] (ED.north);
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/encdec_att.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=1.5em,minimum width=2em,
# inner xsep=1em,inner ysep=0.5em},
# addstate/.style={draw,circle,inner sep=0.1em,fill=gray!10},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.2em},
# encembed/.style={fill=green!40!gray!40},
# decembed/.style={fill=blue!40!gray!40},
# encoder/.style={fill=green!40!gray!40},
# decoder/.style={fill=blue!40!gray!40},
# outer/.style={outer sep=0},
# attention/.style={minimum width=3cm,minimum height=1cm,text height=0.6em,text depth=0.2em,align=center,font=\bfseries\sffamily,fill=red!60!gray!60}}
#
# % input labels
# \foreach \i [count=\step from 1] in {{$<$S$>$},Musika,maite,dut,{$<$/S$>$}} {
# \node[emptystate] (EncI\step) at (1.5*\step-1.5, 1.0) {\i};
# }
#
# % embedding layer
# \foreach \step in {1,...,5} {
# \node[state,encembed] (EncE\step) at (1.5*\step-1.5, 2.0) {};
# \draw[->] (EncI\step) to (EncE\step);
# }
#
# % encoder LSTMs
# \foreach \step in {1,...,5} {
# \node[state,encoder] (EncLA\step) at (1.5*\step-1.5, 3.0) {};
# \draw[->] (EncE\step) to (EncLA\step);
# \coordinate[below=0.1 of EncLA\step.east] (EncLA_be\step);
# \coordinate[below=0.1 of EncLA\step.west] (EncLA_bw\step);
# \coordinate[above=0.1 of EncLA\step.east] (EncLA_ae\step);
# \coordinate[above=0.1 of EncLA\step.west] (EncLA_aw\step);
# }
# \foreach \step in {1,...,4} {
# \pgfmathtruncatemacro{\next}{add(\step,1)}
# \draw[densely dashed, ->] (EncLA_be\step) to (EncLA_bw\next);
# \draw[densely dashed, ->] (EncLA_aw\next) to (EncLA_ae\step);
# }
#
# % attentional model
# \node[state,attention] (Att) at (1.5, 5.5) {Attention model};
# \node[addstate] (Mult) at (5.25, 5.5) {$\times$};
# \foreach \step in {1,...,5} {
# \draw[->] (EncLA\step.north) to (Att);
# \draw[->] (EncLA\step.north) to (Mult);
# }
# \draw[densely dashed, ->] (Att) to (Mult);
#
#
# % decoder
# \node[emptystate] (DecO1) at (0.0, 11.5) {love};
# \node[emptystate] (DecO2) at (3.6, 11.5) {music};
# \foreach \i [count=\step from 1] in {I,love} {
# \node[emptystate] (DecI\step) at (3.6*\step-3.6, 7.5) {\i};
# \node[state,decembed] (DecE\step) at (3.6*\step-3.6, 8.5) {};
# \node[state,decoder] (DecL\step) at (3.6*\step-3.6, 9.5) {};
# \node[state,decoder] (DecD\step) at (3.6*\step-3.6, 10.5) {};
# }
#
# \node[emptystate] (DecI0) at (-1.5, 7.5) {$<$S$>$};
# \node[state,decembed] (DecE0) at (-1.5, 8.5) {};
# \node[state,decoder] (DecL0) at (-1.5, 9.5) {};
# \node[state,decoder] (DecD0) at (-1.5, 10.5) {};
# \node[emptystate] (DecO0) at (-1.5, 11.5) {I};
#
# \foreach \step in {0,...,2} {
# \draw[->] (DecI\step) to (DecE\step);
# \draw[->] (DecE\step) to (DecL\step);
# \draw[->] (DecL\step) to (DecD\step);
# \draw[->] (DecD\step) to (DecO\step);
# }
#
# \node[emptystate] (DecLx) at (5.1, 9.5) {};
# \draw[densely dashed, ->] (DecL0) to (DecL1);
# \draw[densely dashed, ->] (DecL1) to (DecL2);
# \draw[densely dashed, ->] (DecL2) to (DecLx);
# \draw[densely dashed, ->, rounded corners=12pt] (DecL1) -| (Att);
# \draw[densely dashed, ->, rounded corners=12pt] (Mult.north) |-([shift={(5mm,8mm)}]Att.north) |- (DecL2.west);
#
# \node[emptystate] at (0.8, 7) {};
# \node[emptystate] at (6.55, 6.85) {\sffamily context vector $z_t$};
# %%tikz -l arrows,positioning -s 1100,500 -sc 1 -f svg --save mt_figures/align.svg
#
# \tikzset{state/.style={draw,rectangle,minimum height=2em,minimum width=3.5em,
# inner xsep=1em,inner ysep=0.5em,text height=1em,text depth=0.15em},
# emptystate/.style={inner sep=0.4em,text height=0.6em,text depth=0.1em},
# label/.style={align=center,font=\itshape\small}}
#
# \node[emptystate] (I1) at (0, 0) {{Musika}};
# \node[emptystate] (I2) at (1.2, 0) {{maite}};
# \node[emptystate] (I3) at (2.1, 0) {{dut}};
#
# \node[emptystate] (O1) at (0, 2) {{I}};
# \node[emptystate] (O2) at (0.6, 2) {{love}};
# \node[emptystate] (O3) at (1.6, 2) {{music}};
#
#
# \draw [-] (I1.north) to (O3.south);
# \draw [-] (I3.north) to (O1.285);
# \draw [-] (I2.north) to (O2.south);