require 'nn'
torch.manualSeed(287)
$\newcommand{\reals}{\mathbb{R}}$ $\newcommand{\boldx}{\mathbf{x}}$ $\newcommand{\boldw}{\mathbf{w}}$
We will use nn.CMul as our first example of a module.
f = nn.CMul(size)
creates a module that computes the function $f: \reals^{size} \rightarrow \reals^{size}$ defined by $f(\boldx) = \boldx \odot \boldw$, where $\odot$ is elementwise multiplication. $\boldw$ are the function's parameters, which 'nn' will automatically initialize to something reasonable.
x = torch.range(1, 5) -- will input this into the module
f = nn.CMul(x:size()) -- create the module
-- let's see what f's parameters were initialized to. ('nn' always inits to something reasonable)
print(f.weight)
print()
-- to apply f to an input x we call f:forward(x)
print(f:forward(x))
-0.3180 -0.0484 0.1885 0.4165 0.3364 [torch.DoubleTensor of size 5] -0.3180 -0.0968 0.5655 1.6661 1.6822 [torch.DoubleTensor of size 5]
-- modules are stateful; they store their parameters (if any) and also their last output
print(f.output) -- N.B. every module has an 'output' member
-0.3180 -0.0968 0.5655 1.6661 1.6822 [torch.DoubleTensor of size 5]
Let's create another simple module.
g = nn.Sum(j)
creates a module computing the function $g: \reals^{D_1 \times \ldots \times \, D_j \, \times \ldots \times \, D_M} \rightarrow \reals^{D_1 \times \ldots \times \, D_{j-1} \, \times \, D_{j+1} \, \times \ldots \times \, D_M}$ that sums the input over dimension $j$ (thus decreasing the number of dimensions by 1).
g = nn.Sum(1) -- sum over dimension 1
print(g:forward(x))
15 [torch.DoubleTensor of size 1]
Most modules allow batching of inputs along the first dimension. That is, if your module expects inputs $x \in \reals^{size}$, you can give it an input $X \in \reals^{N \times size}$, and it will apply itself to each $x$ along the first dimension of $X$
-- let's batch calls to f
X = x:view(1,5):expand(3, 5) -- here, N = 3
print(f:forward(X))
-- whenever you can, you should batch; it'll be much faster
-0.3180 -0.0968 0.5655 1.6661 1.6822 -0.3180 -0.0968 0.5655 1.6661 1.6822 -0.3180 -0.0968 0.5655 1.6661 1.6822 [torch.DoubleTensor of size 3x5]
Individual modules can be combined using 'Container' modules to compute more complicated functions. For instance, the modules g and f can be composed to get g(f()) using
nn.Sequential()
h = nn.Sequential() -- this module computes the function defined by composing its child modules' functions in order
h:add(f) -- add the module f as h's first child
h:add(g) -- add the module g as h's second child
print(h:forward(x)) -- computes g(f(x))) = sum_i [ x \odot w ], where \odot is elementwise multiplication
3.4990 [torch.DoubleTensor of size 1]
-- though nn.Sequential is the container you'll use most (at least early in the course), there are others.
-- nn.Concat(j) is a container that computes the function defined by applying each of its child modules to a single
-- input, and then concatenating the respective outputs along dimension j
cat = nn.Concat(1) -- concatenate outputs along 1st dimension
cat:add(f)
cat:add(g)
print(cat:forward(x))
-0.3180 -0.0968 0.5655 1.6661 1.6822 15.0000 [torch.DoubleTensor of size 6]
-- You can print a module to see its contents
print(h)
nn.Sequential { [input -> (1) -> (2) -> output] (1): nn.CMul (2): nn.Sum } { gradInput : DoubleTensor - empty modules : { 1 : nn.CMul { output : DoubleTensor - size: 5 gradInput : DoubleTensor - empty _output : DoubleTensor - size: 5 _repeat : DoubleTensor - empty _expand : DoubleTensor - size: 3x5 gradWeight : DoubleTensor - size: 5 _weight : DoubleTensor - size: 5 size : LongStorage - size: 1 weight : DoubleTensor - size: 5 } 2 : nn.Sum { gradInput : DoubleTensor - empty
dimension : 1 output : DoubleTensor - size: 1 } } output : DoubleTensor - size: 1 }
-- to access the children of containers you can use :get(i) or index into a list of children returned by .modules
print(h:get(1))
print(h.modules[1])
nn.CMul { output : DoubleTensor - size: 5 gradInput : DoubleTensor - empty _output : DoubleTensor - size: 5 _repeat : DoubleTensor - empty _expand : DoubleTensor - size: 3x5 gradWeight : DoubleTensor - size: 5 _weight : DoubleTensor - size: 5 size : LongStorage - size: 1 weight : DoubleTensor - size: 5 } nn.CMul { output : DoubleTensor - size: 5 gradInput : DoubleTensor - empty _output : DoubleTensor - size: 5 _repeat : DoubleTensor - empty _expand : DoubleTensor - size: 3x5 gradWeight : DoubleTensor - size: 5 _weight : DoubleTensor - size: 5 size : LongStorage - size: 1 weight : DoubleTensor - size: 5 }
$\newcommand{\boldz}{\mathbf{z}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$
Suppose we have a module computing a function $h$ that participates in the definition of a loss function $L$. For a particular input $\boldx \in \reals^n$ let $\boldz \in \reals^m$ be defined by $\boldz = h(\boldx)$, which allows us to write our loss function as $L(\boldz)$. By the (multivariate) chain rule, the gradient of $L$ wrt $x_i$ is
\begin{align*} \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial x_i} \end{align*}Assuming $L$ returns a scalar, we can rewrite the above for the entire $\boldx$ as
\begin{align*} \frac{\partial L}{\partial \boldx} = \left(\frac{\partial L}{\partial \boldz}\right)^T \frac{\partial \boldz}{\partial \boldx}, \end{align*}where $\frac{\partial \boldz}{\partial \boldx}$ is the Jacobian, which lives in $\reals^{m \times n}$.
Each 'nn' module knows how to (implicitly) compute $\frac{\partial \boldz}{\partial \boldx}$ -- the gradient of its output wrt its input -- and so can compute $\frac{\partial L}{\partial \boldx}$ if it is also handed $\frac{\partial L}{\partial \boldz}$. In just the same way, if a module has parameters $\theta$, it knows how to calculate $\frac{\partial \boldz}{\partial \btheta}$, and can therefore calculate $\frac{\partial L}{\partial \btheta}$ if it is handed $\frac{\partial L}{\partial \boldz}$.
It's very important to know the 'nn' terminology for these gradients:
$\frac{\partial L}{\partial \boldx}$ is called 'gradInput' in nn; it's the gradient of the loss wrt a module's input
$\frac{\partial L}{\partial \boldz}$ is called 'gradOutput' in nn; it's the gradient of the loss wrt a module's output
$\frac{\partial L}{\partial \btheta}$ is called either 'gradWeight' or 'gradBias' in nn; it's the gradient of the loss wrt a module's parameters
Given $\frac{\partial \boldz}{\partial \boldx}$, an 'nn' module computes $\frac{\partial L}{\partial \boldx}$ with the :backward() function, and stores it in its 'gradInput' member, as follows:
gradOut = torch.randn(1) -- let's make up a random gradOutput for dL/dz (of same dimension as output of h)
print(gradOut)
-- let's now compute dL/dx with our gradOut
-- N.B. you MUST call :forward() before :backward() (and provide the same input); note we called :forward() above
h:backward(x, gradOut)
print(h.gradInput) -- N.B. each module also must have a gradInput member
1.6834 [torch.DoubleTensor of size 1] -0.5353 -0.0815 0.3173 0.7012 0.5664 [torch.DoubleTensor of size 5]
Let's check the gradients :backward() computed. Recall $h = g(f(\boldx))$ = nn.Sequential():add(f):add(g). Because $h$ is an nn.Sequential/composition, it should first get gradient of $g$ wrt its input, which is $f(\boldx)$. So, we get \begin{align*} \frac{\partial L}{\partial f(\boldx)} = \frac{\partial L}{\partial h(\boldx)} \cdot \frac{\partial h(\boldx)}{\partial f(\boldx)} = gradOut \cdot g'(f(\boldx)) \end{align*}
Since $g$ just sums, $g'(\boldx)_i = 1$ for each $i$, and so $\frac{\partial L}{\partial f(\boldx)} = gradOut \cdot$ torch.ones(x:size())
print(g.gradInput)
assert((g.gradInput - torch.ones(x:size()):mul(gradOut[1])):abs():max() < 1e-10)
1.6834 1.6834 1.6834 1.6834 1.6834 [torch.DoubleTensor of size 5]
Now that we have $\frac{\partial L}{\partial f(\boldx)}$, we can calculate $\frac{\partial L}{\partial \boldx}$ as $(\frac{\partial L}{\partial f(\boldx)})^T \frac{\partial f}{\partial \boldx}$. Since $f(\boldx)_i = w_i \cdot x_i$, we have that $\frac{\partial f_i}{\partial x_i} = w_i$, and is 0 everywhere else. Thus, $\frac{\partial f}{\partial \boldx}$ is g.gradInput$^T diag(\boldw) =$ g.gradInput $\odot \boldw$.
assert((f.gradInput - torch.cmul(g.gradInput, f.weight)):abs():max() < 1e-10)
-- (Note that f.gradInput == h.gradInput, since h = g(f(x)))
In addition to computing $\frac{\partial L}{\partial \boldx}$, backward() also computes $\frac{\partial L}{\partial \btheta}$, where $\btheta$ are the module's parameters. Specifically, modules accumulate the gradients wrt their parameters in their 'gradWeight' and 'gradBias' members. So, let's redo the above example, this time paying attention to parameters.
-- since backward() accumulates (i.e., adds) gradients, we need to start by zeroing out gradWeight and gradBias
h:zeroGradParameters() -- N.B. calling zeroGradParameters() on a container recursively zeroes grads on children
h:backward(x, gradOut)
print(f.gradWeight)
-- let's check that gradient was correct, using a calculation similar to the one used above for dL/dx
assert((f.gradWeight - torch.cmul(g.gradInput, x)):abs():max() < 1e-10)
1.6834 3.3668 5.0503 6.7337 8.4171 [torch.DoubleTensor of size 5]
Now that we know about :forward() and :backward() let's get a more precise sense of how they work. You'll need to know this if you ever want to implement your own modules!
-- Recall that Module is an abstract class. The (abstract) functions :forward() and :backward() are defined in terms
-- of 3 functions subclasses must implement: updateOutput(), updateGradInput(), accGradParameters()
-- The below code is from https://github.com/torch/nn/blob/master/Module.lua; the comments are mine
function Module:forward(input)
return self:updateOutput(input) -- subclasses must implement updateOutput, which sets self.output
end
function Module:backward(input, gradOutput, scale)
scale = scale or 1
self:updateGradInput(input, gradOutput) -- subclasses must implement updateGradInput, which sets self.gradInput
self:accGradParameters(input, gradOutput, scale) -- subclasses must add dL/d\theta to self.gradWeight etc
return self.gradInput
end
-- here are some very simplified versions of these 3 functions for the CMul module (with new comments),
-- adapted from https://github.com/torch/nn/blob/master/CMul.lua
-- N.B. CMul inherits from module, and so has .output, and .gradInput members;
-- because it has parameters, it also has .weight and .gradWeight members
function CMul:updateOutput(input)
self.output:resizeAs(input):copy(input) -- self.output = input
self.output:cmul(self.weight) -- self._output = self._output .* self._weight
return self.output
end
function CMul:updateGradInput(input, gradOutput)
self.gradInput:resizeAs(input):zero() -- zero out our gradInput storer
self.gradInput:addcmul(1, self.weight, gradOutput) -- self.gradInput = self.gradOutput .* self.weight
return self.gradInput
end
function CMul:accGradParameters(input, gradOutput, scale)
scale = scale or 1
-- don't zero out gradWeight, because we're accumulating!
self.gradWeight:addcmul(scale, input, gradOutput) -- self.gradWeight = self.gradOutput .* self.input
end
-- In addition to nn.CMul, you will likely want to know about
lin = nn.Linear(x:size(1), 3) -- computes Wx + b, where W \in R^{5 x 3} and b \in R^3
print(lin:forward(x))
-2.8031 1.9383 -1.8525 [torch.DoubleTensor of size 3]
-- LookupTables will be extremely important for this course; they map indices to corresponding weight vectors
LT = nn.LookupTable(5, 3) -- maps indices (1 thru 5) to corresponding weight vectors, which live in R^3
-- let's look at a LookupTable's weights
print(LT.weight)
-- LookupTables take indices as input!
idxs = torch.LongTensor({1,2,5})
print(LT:forward(idxs)) -- extracts 1st, 2nd, and 5th rows of weights
-- can also batch input to a LookupTable, as follows
batchIdxs = torch.LongTensor({{1, 3}, {4, 5}, {2, 3}}) -- here, there are 3 examples, each associated with 2 idxs
print(LT:forward(batchIdxs))
-0.3408 0.8594 1.2139 0.1566 -0.5897 0.1788 -0.0315 0.1821 -0.7354 0.1246 0.2621 -0.0320 1.1030 -0.9798 0.3587 [torch.DoubleTensor of size 5x3] -0.3408 0.8594 1.2139 0.1566 -0.5897 0.1788 1.1030 -0.9798 0.3587 [torch.DoubleTensor of size 3x3] (1,.,.) = -0.3408 0.8594 1.2139 -0.0315 0.1821 -0.7354 (2,.,.) = 0.1246 0.2621 -0.0320 1.1030 -0.9798 0.3587 (3,.,.) = 0.1566 -0.5897 0.1788 -0.0315 0.1821 -0.7354 [torch.DoubleTensor of size 3x2x3]
-- nn.Add computes x + b, where b \in R^5 (though can also be used to a single constant)
add = nn.Add(x:size())
print(add:forward(x))
-- there are many more (esp. convolutions, which we'll talk about later in the course)!
1.4369 2.2328 2.5575 4.2737 4.6302 [torch.DoubleTensor of size 5]
-- non-linearities/'transfer' functions
x = torch.randn(5)
nonlin1 = nn.Sigmoid()
nonlin2 = nn.LogSoftMax()
nonlin3 = nn.Tanh()
nonlin4 = nn.ReLU()
print(nonlin1:forward(x))
print(nonlin2:forward(x))
print(nonlin3:forward(x))
print(nonlin4:forward(x))
-- other mathematical operations
X = torch.randn(3, 2)
op1 = nn.Max(1, 2) -- maxes over dimension 1, expects 2d input
op2 = nn.Mean(2, 2) -- means over dimension 2, expects 2d input
op3 = nn.Abs()
print(op1:forward(X))
print(op2:forward(X))
print(op3:forward(X))
-- there are also Modules that reshape or review their arguments; one you'll use most often is nn.View,
-- which takes in the desired dimension sizes
print(nn.View(2,3):forward(X))
-- there are many more!
0.6790 0.5264 0.6779 0.6240 0.5938 [torch.DoubleTensor of size 5]
-1.3853 -2.0289 -1.3905 -1.6280 -1.7545 [torch.DoubleTensor of size 5] 0.6346 0.1051 0.6315 0.4672 0.3626 [torch.DoubleTensor of size 5] 0.7491 0.1055 0.7440 0.5065 0.3799 [torch.DoubleTensor of size 5] 0.9011 2.5226 [torch.DoubleTensor of size 2] 0.4434 0.2421 1.2315 [torch.DoubleTensor of size 3] 0.9011 0.0143 0.8167 1.3009 0.0597 2.5226 [torch.DoubleTensor of size 3x2]
0.9011 -0.0143 -0.8167 1.3009 -0.0597 2.5226 [torch.DoubleTensor of size 2x3]
All the containers (and other modules) we've seen so far take in single Tensors as arguments. This won't be sufficient if we want functions of multiple inputs (especially if they're of different sizes or types).
As a motivating example, suppose we want to make a Linear-like layer over both sparse and dense features. That is, we want to compute
\begin{align*} \left[ \mathbf{W}_o \mathbf{W}_d \right] \begin{bmatrix} \boldx_o \\ \boldx_d \end{bmatrix} + \mathbf{b}, \end{align*}where matrices $\mathbf{W}_o$ and $\mathbf{W}_d$ are concatenated horizontally, and a one-hot vector $\boldx_o$ is stacked on top of a dense vector $\boldx_d$ (and $\mathbf{b}$ is a bias). Note that the above is equivalent to $\mathbf{W}_o \boldx_o + \mathbf{W}_d \boldx_d + \mathbf{b}$. Moreover, since we know that $\mathbf{W}_o \boldx_o$ is equivalent to a lookup in a LookupTable, we can do the following:
D_o, D_d, D_h = 5, 3, 2 -- width of W_o, width of W_d, height of both W_o and W_d
x_o = torch.LongTensor({2}) -- index equivalent of [0 1 0 0 0]
x_d = torch.randn(1, D_d)
-- our first example of a Table layer/container
par = nn.ParallelTable() -- takes a TABLE of inputs, applies i'th child to i'th input, and returns a table
par:add(nn.LookupTable(D_o, D_h)) -- first child
par:add(nn.Linear(D_d, D_h)) -- second child
-- this parallel table produces a table of 2 1xD_h tensors corresponding to W_o x_o and W_d x_d + b resp.
print(par:forward({x_o, x_d}))
{ 1 : DoubleTensor - size: 1x2 2 : DoubleTensor - size: 1x2 }
-- to get our full linear transformation, we need to add the two tables.
-- as usual, to compose functions in order we use nn.Sequential
spAndDenseLinear = nn.Sequential()
spAndDenseLinear:add(par)
spAndDenseLinear:add(nn.CAddTable()) -- CAddTable adds its incoming tables
-- let's look at spAndDenseLinear
print(spAndDenseLinear)
print()
nn.Sequential { [input -> (1) -> (2) -> output] (1): nn.ParallelTable { input |`-> (1): nn.LookupTable |`-> (2): nn.Linear(3 -> 2) ... -> output } (2): nn.CAddTable } { gradInput : table: 0x40e50d38 modules : { 1 : nn.ParallelTable { input |`-> (1): nn.LookupTable |`-> (2): nn.Linear(3 -> 2) ... -> output } { gradInput : table: 0x40e50d38 modules : { 1 : nn.LookupTable { copiedInput : false weight : DoubleTensor - size: 5x2 shouldScaleGradByFreq : false
gradWeight : DoubleTensor - size: 5x2 gradInput : DoubleTensor - empty _count : IntTensor - empty _input : LongTensor - empty output : DoubleTensor - size: 1x2 } 2 : nn.Linear(3 -> 2) { gradBias : DoubleTensor - size: 2 weight : DoubleTensor - size: 2x3 bias : DoubleTensor - size: 2 gradInput : DoubleTensor - empty addBuffer : DoubleTensor - size: 1 gradWeight : DoubleTensor - size: 2x3 output : DoubleTensor - size: 1x2 } } output : { 1 : DoubleTensor - size: 1x2 2 : DoubleTensor - size: 1x2
} } 2 : nn.CAddTable { gradInput : table: 0x41759cc8 output : DoubleTensor - empty } } output : DoubleTensor - empty }
-- finally, let's use spAndDenseLinear to compute W_o x_o + W_d x_d + b
print(spAndDenseLinear:forward({x_o, x_d}))
1.2212 -2.0951 [torch.DoubleTensor of size 1x2]
Note that table containers/layers allow networks to take tables as input and produce them as output. Here we show some more layers that are useful when dealing with table inputs.
t = {torch.randn(1, 3), torch.randn(1, 3), torch.randn(1, 2)}
-- JoinTable(dim, nDims) makes a tensor from a table of tensors (of nDims dimensions) by concating along dim
print(nn.JoinTable(2, 2):forward(t))
print()
-- NarrowTable(offset, len) returns len tables starting at offset
print(nn.NarrowTable(1, 2):forward(t))
-0.6461 2.0589 -0.7707 0.0129 -0.6880 0.9008 0.0479 1.5202 [torch.DoubleTensor of size 1x8] { 1 : DoubleTensor - size: 1x3 2 : DoubleTensor - size: 1x3 }
Criterion objects are used to represent loss functions. They are similar to modules in that you can call :forward() and :backward() on them, and that they have .output and .gradInput members. The major difference is that :forward() takes 2 arguments, viz., scores/predictions and the true scores/labels.
mse = nn.MSECriterion() -- mean squared error criterion, often used for (scalar) regression loss
y = torch.randn(3) -- scalar targets
yhat = torch.zeros(3) -- scalar predictions
print(mse:forward(yhat, y)) -- returns MSE = 1/n sum_i (yhat_i - t_i)^2
print()
-- can remove 1/n factor as follows
mse.sizeAverage = false
print(mse:forward(yhat, y))
print()
-- get gradient of per-example Losses wrt predictions using :backward()
dLdyhat = mse:backward(yhat, y)
print(dLdyhat)
print()
-- generally will pass dLdyhat as gradOutput when calling :backward() on network that computed yhat
1.8225688302461 5.4677064907383 2.3650 -1.6119 3.6986 [torch.DoubleTensor of size 3]
-- here's a classification criterion you're likely to use:
nllcrit = nn.ClassNLLCriterion() -- log loss for multiclass classification; expects log-probabilities and true class
Z = torch.randn(2, 3) -- 3-class classification scores for 2 examples
Yhat = nn.LogSoftMax():forward(Z) -- make log probabilities
Y = torch.Tensor({3,1}) -- true classes
print(nllcrit:forward(Yhat, Y))
print()
print(nllcrit:backward(Yhat, Y)) -- N.B. ClassNLLCriterion (by default) divides by numExamples, which affects grads
Here we'll show how to train a 1-layer neural network on a regression-style task.
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)
y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))
-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()
mlp:add(nn.Linear(F,H))
mlp:add(nn.Tanh())
mlp:add(nn.Linear(H, 1))
-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))
-- now define our criterion
mse = nn.MSECriterion()
-- we can flatten (and then retrieve) all parameters (and gradParameters) of a module in the following way:
params, gradParams = mlp:getParameters() -- N.B. getParameters() moves around memory, and should only be called once!
eta = 0.01
-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
-- zero out our gradients
gradParams:zero()
-- do forward pass
preds = mlp:forward(X)
-- get loss
loss = mse:forward(preds, y)
print("epoch " .. i .. ", loss: " .. loss)
-- backprop
dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
mlp:backward(X, dLdpreds)
-- update params with sgd step
params:add(-eta, gradParams)
end
While extracting parameters and gradParameters with :getParameters() can often be useful, especially if you want to hand them to more sophisticated optimization algorithms (e.g., in the 'optim' package), if you're just doing (S)GD you can also use the module function :updateParameters(). Here's the same example as above using :updateParameters()
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)
y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))
-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()
mlp:add(nn.Linear(F,H))
mlp:add(nn.Tanh())
mlp:add(nn.Linear(H, 1))
-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))
-- now define our criterion
mse = nn.MSECriterion()
eta = 0.01
-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
-- zero out our gradients
mlp:zeroGradParameters()
-- do forward pass
preds = mlp:forward(X)
-- get loss
loss = mse:forward(preds, y)
print("epoch " .. i .. ", loss: " .. loss)
-- backprop
dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
mlp:backward(X, dLdpreds)
-- update params with sgd step
mlp:updateParameters(eta) -- computes parameters = parameters - eta * gradient
end
If you're code doesn't seem to be working, here are some things to check:
Also worth noting that if you're doing something vaguely complicated/non-standard with 'nn', you should always check your gradients (with finite differences)!