Implementation of Recurrent Neural Networks from Scratch¶

In [1]:

%matplotlib inline
import d2l
import math
from mxnet import autograd, np, npx, gluon
npx.set_np()

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

One-hot encoding: map each token to a unique unit vector.

In [2]:

npx.one_hot(np.array([0, 2]), len(vocab))

Out[2]:

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Map a (batch size, time step) mini-batch to (time step, batch size, vocabulary size)

In [3]:

X = np.arange(10).reshape((2, 5))
npx.one_hot(X.T, 28).shape

Out[3]:

(5, 2, 28)

Gradient clipping:

$\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.$

In [10]:

def grad_clipping(model, theta):
    if isinstance(model, gluon.Block):
        params = [p.data() for p in model.collect_params().values()]
    else:
        params = model.params
    norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm