In [2]:
require 'nn'
torch.manualSeed(287)


# 1. Preliminaries¶

## What does 'nn' buy us?¶

• Let's us declaratively specify neural network architectures that compute their forward and backward passes automatically
• This is huge! Now we don't need to hand-code gradients; can just define our model and optimize

## Networks are specified with (essentially) two kinds of abstract objects: Modules and Criteria¶

• Modules (recursively) define a transformation from an input to an output; can think of them as functions
• A Criterion calculates a loss based on an input (typically computed by a module) and a target

# 2. Modules¶


We will use nn.CMul as our first example of a module.

f = nn.CMul(size)


creates a module that computes the function $f: \reals^{size} \rightarrow \reals^{size}$ defined by $f(\boldx) = \boldx \odot \boldw$, where $\odot$ is elementwise multiplication. $\boldw$ are the function's parameters, which 'nn' will automatically initialize to something reasonable.

## 2a. The Forward Pass¶

In [2]:
x = torch.range(1, 5) -- will input this into the module

f = nn.CMul(x:size()) -- create the module
-- let's see what f's parameters were initialized to. ('nn' always inits to something reasonable)
print(f.weight) -- N.B. The ability to introspect each 'nn' module is what makes it often more convenient than Theano
print()

-- to apply f to an input x we call f:forward(x)
print(f:forward(x))

Out[2]:
-0.3180
-0.0484
0.1885
0.4165
0.3364
[torch.DoubleTensor of size 5]

-0.3180
-0.0968
0.5655
1.6661
1.6822
[torch.DoubleTensor of size 5]


In [3]:
-- modules are stateful; they store their parameters (if any) and also their last output
print(f.output) -- N.B. every module has an 'output' member

Out[3]:
-0.3180
-0.0968
0.5655
1.6661
1.6822
[torch.DoubleTensor of size 5]



Let's create another simple module.

g = nn.Sum(j)


creates a module computing the function $g: \reals^{D_1 \times \ldots \times \, D_j \, \times \ldots \times \, D_M} \rightarrow \reals^{D_1 \times \ldots \times \, D_{j-1} \, \times \, D_{j+1} \, \times \ldots \times \, D_M}$ that sums the input over dimension $j$ (thus decreasing the number of dimensions by 1).

In [4]:
g = nn.Sum(1) -- sum over dimension 1
print(g:forward(x))

Out[4]:
 15
[torch.DoubleTensor of size 1]



## 2b. Batching¶

Most modules allow batching of inputs along the first dimension. That is, if your module expects inputs $x \in \reals^{size}$, you can give it an input $X \in \reals^{N \times size}$, and it will apply itself to each $x$ along the first dimension of $X$

In [5]:
-- let's batch calls to f
X = x:view(1,5):expand(3, 5) -- here, N = 3
print(f:forward(X))

-- whenever you can, you should batch; it'll be much faster

Out[5]:
-0.3180 -0.0968  0.5655  1.6661  1.6822
-0.3180 -0.0968  0.5655  1.6661  1.6822
-0.3180 -0.0968  0.5655  1.6661  1.6822
[torch.DoubleTensor of size 3x5]



## 2c. Container Modules¶

Individual modules can be combined using 'Container' modules to compute more complicated functions. For instance, the modules g and f can be composed to get g(f()) using

nn.Sequential()

In [6]:
h = nn.Sequential() -- this module computes the function defined by composing its child modules' functions in order
print(h:forward(x)) -- computes g(f(x))) = sum_i [ x \odot w ], where \odot is elementwise multiplication

Out[6]:
 3.4990
[torch.DoubleTensor of size 1]


In [7]:
-- though nn.Sequential is the container you'll use most (at least early in the course), there are others.
-- nn.Concat(j) is a container that computes the function defined by applying each of its child modules to a single
-- input, and then concatenating the respective outputs along dimension j
cat = nn.Concat(1) -- concatenate outputs along 1st dimension
print(cat:forward(x))

Out[7]:
 -0.3180
-0.0968
0.5655
1.6661
1.6822
15.0000
[torch.DoubleTensor of size 6]


In [8]:
-- You can print a module to see its contents
print(h)

Out[8]:
nn.Sequential {
[input -> (1) -> (2) -> output]
(1): nn.CMul
(2): nn.Sum
}
{
modules :
{
1 :
nn.CMul
{
output : DoubleTensor - size: 5
_output : DoubleTensor - size: 5
_repeat : DoubleTensor - empty
_expand : DoubleTensor - size: 3x5
gradWeight : DoubleTensor - size: 5
_weight : DoubleTensor - size: 5
size : LongStorage - size: 1
weight : DoubleTensor - size: 5
}
2 :
nn.Sum
{

Out[8]:
dimension : 1
output : DoubleTensor - size: 1
}
}
output : DoubleTensor - size: 1
}


In [9]:
-- to access the children of containers you can use :get(i) or index into a list of children returned by .modules
print(h:get(1))
print(h.modules[1])

Out[9]:
nn.CMul
{
output : DoubleTensor - size: 5
_output : DoubleTensor - size: 5
_repeat : DoubleTensor - empty
_expand : DoubleTensor - size: 3x5
gradWeight : DoubleTensor - size: 5
_weight : DoubleTensor - size: 5
size : LongStorage - size: 1
weight : DoubleTensor - size: 5
}
nn.CMul
{
output : DoubleTensor - size: 5
_output : DoubleTensor - size: 5
_repeat : DoubleTensor - empty
_expand : DoubleTensor - size: 3x5
gradWeight : DoubleTensor - size: 5
_weight : DoubleTensor - size: 5
size : LongStorage - size: 1
weight : DoubleTensor - size: 5
}


## 2d. The Backward Pass¶


Suppose we have a module computing a function $h$ that participates in the definition of a loss function $L$. For a particular input $\boldx \in \reals^n$ let $\boldz \in \reals^m$ be defined by $\boldz = h(\boldx)$, which allows us to write our loss function as $L(\boldz)$. By the (multivariate) chain rule, the gradient of $L$ wrt $x_i$ is

\begin{align*} \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial x_i} \end{align*}

Assuming $L$ returns a scalar, we can rewrite the above for the entire $\boldx$ as

\begin{align*} \frac{\partial L}{\partial \boldx} = \left(\frac{\partial L}{\partial \boldz}\right)^T \frac{\partial \boldz}{\partial \boldx}, \end{align*}

where $\frac{\partial \boldz}{\partial \boldx}$ is the Jacobian, which lives in $\reals^{m \times n}$.

Each 'nn' module knows how to (implicitly) compute $\frac{\partial \boldz}{\partial \boldx}$ -- the gradient of its output wrt its input -- and so can compute $\frac{\partial L}{\partial \boldx}$ if it is also handed $\frac{\partial L}{\partial \boldz}$. In just the same way, if a module has parameters $\theta$, it knows how to calculate $\frac{\partial \boldz}{\partial \btheta}$, and can therefore calculate $\frac{\partial L}{\partial \btheta}$ if it is handed $\frac{\partial L}{\partial \boldz}$.

It's very important to know the 'nn' terminology for these gradients:

• $\frac{\partial L}{\partial \boldx}$ is called 'gradInput' in nn; it's the gradient of the loss wrt a module's input

• $\frac{\partial L}{\partial \boldz}$ is called 'gradOutput' in nn; it's the gradient of the loss wrt a module's output

• $\frac{\partial L}{\partial \btheta}$ is called either 'gradWeight' or 'gradBias' in nn; it's the gradient of the loss wrt a module's parameters

Given $\frac{\partial \boldz}{\partial \boldx}$, an 'nn' module computes $\frac{\partial L}{\partial \boldx}$ with the :backward() function, and stores it in its 'gradInput' member, as follows:

In [10]:
gradOut = torch.randn(1) -- let's make up a random gradOutput for dL/dz (of same dimension as output of h)

-- let's now compute dL/dx with our gradOut
-- N.B. you MUST call :forward() before :backward() (and provide the same input); note we called :forward() above

Out[10]:
 1.6834
[torch.DoubleTensor of size 1]

-0.5353
-0.0815
0.3173
0.7012
0.5664
[torch.DoubleTensor of size 5]



Let's check the gradients :backward() computed. Recall $h = g(f(\boldx))$ = nn.Sequential():add(f):add(g). Because $h$ is an nn.Sequential/composition, it should first get gradient of $g$ wrt its input, which is $f(\boldx)$. So, we get \begin{align*} \frac{\partial L}{\partial f(\boldx)} = \frac{\partial L}{\partial h(\boldx)} \cdot \frac{\partial h(\boldx)}{\partial f(\boldx)} = gradOut \cdot g'(f(\boldx)) \end{align*}

Since $g$ just sums, $g'(\boldx)_i = 1$ for each $i$, and so $\frac{\partial L}{\partial f(\boldx)} = gradOut \cdot$ torch.ones(x:size())

In [11]:
print(g.gradInput)

Out[11]:
 1.6834
1.6834
1.6834
1.6834
1.6834
[torch.DoubleTensor of size 5]



Now that we have $\frac{\partial L}{\partial f(\boldx)}$, we can calculate $\frac{\partial L}{\partial \boldx}$ as $(\frac{\partial L}{\partial f(\boldx)})^T \frac{\partial f}{\partial \boldx}$. Since $f(\boldx)_i = w_i \cdot x_i$, we have that $\frac{\partial f_i}{\partial x_i} = w_i$, and is 0 everywhere else. Thus, $\frac{\partial f}{\partial \boldx}$ is g.gradInput$^T diag(\boldw) =$ g.gradInput $\odot \boldw$.

In [12]:
assert((f.gradInput - torch.cmul(g.gradInput, f.weight)):abs():max() < 1e-10)



In addition to computing $\frac{\partial L}{\partial \boldx}$, backward() also computes $\frac{\partial L}{\partial \btheta}$, where $\btheta$ are the module's parameters. Specifically, modules accumulate the gradients wrt their parameters in their 'gradWeight' and 'gradBias' members. So, let's redo the above example, this time paying attention to parameters.

In [13]:
-- since backward() accumulates (i.e., adds) gradients, we need to start by zeroing out gradWeight and gradBias

-- let's check that gradient was correct, using a calculation similar to the one used above for dL/dx

Out[13]:
 1.6834
3.3668
5.0503
6.7337
8.4171
[torch.DoubleTensor of size 5]



## 2e. Module Internals¶

Now that we know about :forward() and :backward() let's get a more precise sense of how they work. You'll need to know this if you ever want to implement your own modules!

In [ ]:
-- Recall that Module is an abstract class. The (abstract) functions :forward() and :backward() are defined in terms

-- The below code is from https://github.com/torch/nn/blob/master/Module.lua; the comments are mine
function Module:forward(input)
return self:updateOutput(input) -- subclasses must implement updateOutput, which sets self.output
end

scale = scale or 1
end

In [ ]:
-- here are some very simplified versions of these 3 functions for the CMul module (with new comments),

-- N.B. CMul inherits from module, and so has .output, and .gradInput members;
-- because it has parameters, it also has .weight and .gradWeight members

function CMul:updateOutput(input)
self.output:resizeAs(input):copy(input) -- self.output = input
self.output:cmul(self.weight)           -- self._output = self._output .* self._weight
return self.output
end

end

scale = scale or 1
-- don't zero out gradWeight, because we're accumulating!
end


## 2f. Some Useful Modules with Parameters¶

In [14]:
-- In addition to nn.CMul, you will likely want to know about
lin = nn.Linear(x:size(1), 3) -- computes Wx + b, where W \in R^{5 x 3} and b \in R^3
print(lin:forward(x))

Out[14]:
-2.8031
1.9383
-1.8525
[torch.DoubleTensor of size 3]


In [15]:
-- LookupTables will be extremely important for this course; they map indices to corresponding weight vectors
LT = nn.LookupTable(5, 3) -- maps indices (1 thru 5) to corresponding weight vectors, which live in R^3

-- let's look at a LookupTable's weights
print(LT.weight)

-- LookupTables take indices as input!
idxs = torch.LongTensor({1,2,5})
print(LT:forward(idxs)) -- extracts 1st, 2nd, and 5th rows of weights

-- can also batch input to a LookupTable, as follows
batchIdxs = torch.LongTensor({{1, 3}, {4, 5}, {2, 3}}) -- here, there are 3 examples, each associated with 2 idxs
print(LT:forward(batchIdxs))

Out[15]:
-0.3408  0.8594  1.2139
0.1566 -0.5897  0.1788
-0.0315  0.1821 -0.7354
0.1246  0.2621 -0.0320
1.1030 -0.9798  0.3587
[torch.DoubleTensor of size 5x3]

-0.3408  0.8594  1.2139
0.1566 -0.5897  0.1788
1.1030 -0.9798  0.3587
[torch.DoubleTensor of size 3x3]

(1,.,.) =
-0.3408  0.8594  1.2139
-0.0315  0.1821 -0.7354

(2,.,.) =
0.1246  0.2621 -0.0320
1.1030 -0.9798  0.3587

(3,.,.) =
0.1566 -0.5897  0.1788
-0.0315  0.1821 -0.7354
[torch.DoubleTensor of size 3x2x3]


In [16]:
-- nn.Add computes x + b, where b \in R^5 (though can also be used to a single constant)

-- there are many more (esp. convolutions, which we'll talk about later in the course)!

Out[16]:
 1.4369
2.2328
2.5575
4.2737
4.6302
[torch.DoubleTensor of size 5]



## 2g. Some Useful Modules Without Parameters¶

In [17]:
-- non-linearities/'transfer' functions
x = torch.randn(5)
nonlin1 = nn.Sigmoid()
nonlin2 = nn.LogSoftMax()
nonlin3 = nn.Tanh()
nonlin4 = nn.ReLU()

print(nonlin1:forward(x))
print(nonlin2:forward(x))
print(nonlin3:forward(x))
print(nonlin4:forward(x))

-- other mathematical operations
X = torch.randn(3, 2)
op1 = nn.Max(1, 2) -- maxes over dimension 1, expects 2d input
op2 = nn.Mean(2, 2) -- means over dimension 2, expects 2d input
op3 = nn.Abs()

print(op1:forward(X))
print(op2:forward(X))
print(op3:forward(X))

-- there are also Modules that reshape or review their arguments; one you'll use most often is nn.View,
-- which takes in the desired dimension sizes
print(nn.View(2,3):forward(X))

-- there are many more!

Out[17]:
 0.6790
0.5264
0.6779
0.6240
0.5938
[torch.DoubleTensor of size 5]


Out[17]:
-1.3853
-2.0289
-1.3905
-1.6280
-1.7545
[torch.DoubleTensor of size 5]

0.6346
0.1051
0.6315
0.4672
0.3626
[torch.DoubleTensor of size 5]

0.7491
0.1055
0.7440
0.5065
0.3799
[torch.DoubleTensor of size 5]

0.9011
2.5226
[torch.DoubleTensor of size 2]

0.4434
0.2421
1.2315
[torch.DoubleTensor of size 3]

0.9011  0.0143
0.8167  1.3009
0.0597  2.5226
[torch.DoubleTensor of size 3x2]


Out[17]:
 0.9011 -0.0143 -0.8167
1.3009 -0.0597  2.5226
[torch.DoubleTensor of size 2x3]



All the containers (and other modules) we've seen so far take in single Tensors as arguments. This won't be sufficient if we want functions of multiple inputs (especially if they're of different sizes or types).

As a motivating example, suppose we want to make a Linear-like layer over both sparse and dense features. That is, we want to compute

\begin{align*} \left[ \mathbf{W}_o \mathbf{W}_d \right] \begin{bmatrix} \boldx_o \\ \boldx_d \end{bmatrix} + \mathbf{b}, \end{align*}

where matrices $\mathbf{W}_o$ and $\mathbf{W}_d$ are concatenated horizontally, and a one-hot vector $\boldx_o$ is stacked on top of a dense vector $\boldx_d$ (and $\mathbf{b}$ is a bias). Note that the above is equivalent to $\mathbf{W}_o \boldx_o + \mathbf{W}_d \boldx_d + \mathbf{b}$. Moreover, since we know that $\mathbf{W}_o \boldx_o$ is equivalent to a lookup in a LookupTable, we can do the following:

In [18]:
D_o, D_d, D_h = 5, 3, 2 -- width of W_o, width of W_d, height of both W_o and W_d
x_o = torch.LongTensor({2}) -- index equivalent of [0 1 0 0 0]
x_d = torch.randn(1, D_d)

-- our first example of a Table layer/container
par = nn.ParallelTable() -- takes a TABLE of inputs, applies i'th child to i'th input, and returns a table

-- this parallel table produces a table of 2 1xD_h tensors corresponding to W_o x_o and W_d x_d + b resp.
print(par:forward({x_o, x_d}))

Out[18]:
{
1 : DoubleTensor - size: 1x2
2 : DoubleTensor - size: 1x2
}

In [19]:
-- to get our full linear transformation, we need to add the two tables.
-- as usual, to compose functions in order we use nn.Sequential
spAndDenseLinear = nn.Sequential()

-- let's look at spAndDenseLinear
print(spAndDenseLinear)
print()

Out[19]:
nn.Sequential {
[input -> (1) -> (2) -> output]
(1): nn.ParallelTable {
input
|-> (1): nn.LookupTable
|-> (2): nn.Linear(3 -> 2)
... -> output
}
}
{
modules :
{
1 :
nn.ParallelTable {
input
|-> (1): nn.LookupTable
|-> (2): nn.Linear(3 -> 2)
... -> output
}
{
modules :
{
1 :
nn.LookupTable
{
copiedInput : false
weight : DoubleTensor - size: 5x2

Out[19]:
    gradWeight : DoubleTensor - size: 5x2
_count : IntTensor - empty
_input : LongTensor - empty
output : DoubleTensor - size: 1x2
}
2 :
nn.Linear(3 -> 2)
{
gradBias : DoubleTensor - size: 2
weight : DoubleTensor - size: 2x3
bias : DoubleTensor - size: 2
addBuffer : DoubleTensor - size: 1
gradWeight : DoubleTensor - size: 2x3
output : DoubleTensor - size: 1x2
}
}
output :
{
1 : DoubleTensor - size: 1x2
2 : DoubleTensor - size: 1x2

Out[19]:
          }
}
2 :
{
output : DoubleTensor - empty
}
}
output : DoubleTensor - empty
}


In [20]:
-- finally, let's use spAndDenseLinear to compute W_o x_o + W_d x_d + b
print(spAndDenseLinear:forward({x_o, x_d}))

Out[20]:
 1.2212 -2.0951
[torch.DoubleTensor of size 1x2]



Note that table containers/layers allow networks to take tables as input and produce them as output. Here we show some more layers that are useful when dealing with table inputs.

In [21]:
t = {torch.randn(1, 3), torch.randn(1, 3), torch.randn(1, 2)}

-- JoinTable(dim, nDims) makes a tensor from a table of tensors (of nDims dimensions) by concating along dim
print(nn.JoinTable(2, 2):forward(t))
print()
-- NarrowTable(offset, len) returns len tables starting at offset
print(nn.NarrowTable(1, 2):forward(t))

Out[21]:
-0.6461  2.0589 -0.7707  0.0129 -0.6880  0.9008  0.0479  1.5202
[torch.DoubleTensor of size 1x8]

{
1 : DoubleTensor - size: 1x3
2 : DoubleTensor - size: 1x3
}


## 3. Criteria¶

Criterion objects are used to represent loss functions. They are similar to modules in that you can call :forward() and :backward() on them, and that they have .output and .gradInput members. The major difference is that :forward() takes 2 arguments, viz., scores/predictions and the true scores/labels.

In [22]:
mse = nn.MSECriterion() -- mean squared error criterion, often used for (scalar) regression loss
y = torch.randn(3) -- scalar targets
yhat = torch.zeros(3) -- scalar predictions
print(mse:forward(yhat, y)) -- returns MSE = 1/n sum_i (yhat_i - t_i)^2
print()

-- can remove 1/n factor as follows
mse.sizeAverage = false
print(mse:forward(yhat, y))
print()

-- get gradient of per-example Losses wrt predictions using :backward()
dLdyhat = mse:backward(yhat, y)
print(dLdyhat)
print()

-- generally will pass dLdyhat as gradOutput when calling :backward() on network that computed yhat

Out[22]:
1.8225688302461

5.4677064907383

2.3650
-1.6119
3.6986
[torch.DoubleTensor of size 3]


In [ ]:
-- here's a classification criterion you're likely to use:
nllcrit = nn.ClassNLLCriterion() -- log loss for multiclass classification; expects log-probabilities and true class

Z = torch.randn(2, 3) -- 3-class classification scores for 2 examples
Yhat = nn.LogSoftMax():forward(Z) -- make log probabilities
Y = torch.Tensor({3,1}) -- true classes

print(nllcrit:forward(Yhat, Y))
print()
print(nllcrit:backward(Yhat, Y)) -- N.B. ClassNLLCriterion (by default) divides by numExamples, which affects grads


## 4. Training¶

Here we'll show how to train a 1-layer neural network on a regression-style task.

In [3]:
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)

-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()

-- now define our criterion
mse = nn.MSECriterion()

-- we can flatten (and then retrieve) all parameters (and gradParameters) of a module in the following way:
params, gradParams = mlp:getParameters() -- N.B. getParameters() moves around memory, and should only be called once!
eta = 0.01

-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
-- do forward pass
preds = mlp:forward(X)
-- get loss
loss = mse:forward(preds, y)
print("epoch " .. i .. ", loss: " .. loss)
-- backprop
dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
mlp:backward(X, dLdpreds)
-- update params with sgd step
end

Out[3]:
epoch 1, loss: 1.4847964912907

Out[3]:
epoch 2, loss: 1.4471538022835

Out[3]:
epoch 3, loss: 1.4123791149845

Out[3]:
epoch 4, loss: 1.3801837665443

Out[3]:
epoch 5, loss: 1.3503120157518


While extracting parameters and gradParameters with :getParameters() can often be useful, especially if you want to hand them to more sophisticated optimization algorithms (e.g., in the 'optim' package), if you're just doing (S)GD you can also use the module function :updateParameters(). Here's the same example as above using :updateParameters()

In [4]:
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)

-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()

-- now define our criterion
mse = nn.MSECriterion()

eta = 0.01

-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
-- do forward pass
preds = mlp:forward(X)
-- get loss
loss = mse:forward(preds, y)
print("epoch " .. i .. ", loss: " .. loss)
-- backprop
dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
mlp:backward(X, dLdpreds)
-- update params with sgd step
mlp:updateParameters(eta) -- computes parameters = parameters - eta * gradient
end

Out[4]:
epoch 1, loss: 1.4847964912907

Out[4]:
epoch 2, loss: 1.4471538022835
Out[4]:

Out[4]:
epoch 3, loss: 1.4123791149845

Out[4]:
epoch 4, loss: 1.3801837665443

Out[4]:
epoch 5, loss: 1.3503120157518


# 5. Common Bugs¶

If your code doesn't seem to be working, here are some things to check:

• Gradient Tensors are zeroed out before each epoch
• Batching happens only along first dimension

Also worth noting that if you're doing something vaguely complicated/non-standard with 'nn', you should always check your gradients (with finite differences)!

# 6. Exercise: Implement a Multi-class Hinge Loss Criterion¶

(Note that the MultiMarginCriterion in 'nn' isn't quite what we saw in class).

Recall

\begin{align*} L_{hinge}(\hat{\mathbf{y}}, \mathbf{y}) = \max\{0, 1 - (\hat{\mathbf{y}}_c - \hat{\mathbf{y}}_{c'})\} \end{align*}

,

where $c' = \mathrm{argmax}_{i \in \mathcal{C} - \{c\}} \hat{y}_i$