In [2]:
require 'nn'
torch.manualSeed(287)

1. Preliminaries

What does 'nn' buy us?

  • Let's us declaratively specify neural network architectures that compute their forward and backward passes automatically
  • This is huge! Now we don't need to hand-code gradients; can just define our model and optimize

Networks are specified with (essentially) two kinds of abstract objects: Modules and Criteria

  • Modules (recursively) define a transformation from an input to an output; can think of them as functions
  • A Criterion calculates a loss based on an input (typically computed by a module) and a target

Be sure to check the official 'nn' documentation:

This tutorial (which partially inspired the below) may also be useful:

2. Modules

$\newcommand{\reals}{\mathbb{R}}$ $\newcommand{\boldx}{\mathbf{x}}$ $\newcommand{\boldw}{\mathbf{w}}$

We will use nn.CMul as our first example of a module.

f = nn.CMul(size)

creates a module that computes the function $f: \reals^{size} \rightarrow \reals^{size}$ defined by $f(\boldx) = \boldx \odot \boldw$, where $\odot$ is elementwise multiplication. $\boldw$ are the function's parameters, which 'nn' will automatically initialize to something reasonable.

2a. The Forward Pass

In [2]:
x = torch.range(1, 5) -- will input this into the module

f = nn.CMul(x:size()) -- create the module
-- let's see what f's parameters were initialized to. ('nn' always inits to something reasonable)
print(f.weight) -- N.B. The ability to introspect each 'nn' module is what makes it often more convenient than Theano
print()

-- to apply f to an input x we call f:forward(x)
print(f:forward(x))
Out[2]:
-0.3180
-0.0484
 0.1885
 0.4165
 0.3364
[torch.DoubleTensor of size 5]


-0.3180
-0.0968
 0.5655
 1.6661
 1.6822
[torch.DoubleTensor of size 5]

In [3]:
-- modules are stateful; they store their parameters (if any) and also their last output
print(f.output) -- N.B. every module has an 'output' member
Out[3]:
-0.3180
-0.0968
 0.5655
 1.6661
 1.6822
[torch.DoubleTensor of size 5]

Let's create another simple module.

g = nn.Sum(j)

creates a module computing the function $g: \reals^{D_1 \times \ldots \times \, D_j \, \times \ldots \times \, D_M} \rightarrow \reals^{D_1 \times \ldots \times \, D_{j-1} \, \times \, D_{j+1} \, \times \ldots \times \, D_M}$ that sums the input over dimension $j$ (thus decreasing the number of dimensions by 1).

In [4]:
g = nn.Sum(1) -- sum over dimension 1
print(g:forward(x))
Out[4]:
 15
[torch.DoubleTensor of size 1]

2b. Batching

Most modules allow batching of inputs along the first dimension. That is, if your module expects inputs $x \in \reals^{size}$, you can give it an input $X \in \reals^{N \times size}$, and it will apply itself to each $x$ along the first dimension of $X$

In [5]:
-- let's batch calls to f
X = x:view(1,5):expand(3, 5) -- here, N = 3
print(f:forward(X))

-- whenever you can, you should batch; it'll be much faster
Out[5]:
-0.3180 -0.0968  0.5655  1.6661  1.6822
-0.3180 -0.0968  0.5655  1.6661  1.6822
-0.3180 -0.0968  0.5655  1.6661  1.6822
[torch.DoubleTensor of size 3x5]

2c. Container Modules

Individual modules can be combined using 'Container' modules to compute more complicated functions. For instance, the modules g and f can be composed to get g(f()) using

nn.Sequential()
In [6]:
h = nn.Sequential() -- this module computes the function defined by composing its child modules' functions in order
h:add(f) -- add the module f as h's first child
h:add(g) -- add the module g as h's second child
print(h:forward(x)) -- computes g(f(x))) = sum_i [ x \odot w ], where \odot is elementwise multiplication
Out[6]:
 3.4990
[torch.DoubleTensor of size 1]

In [7]:
-- though nn.Sequential is the container you'll use most (at least early in the course), there are others.
-- nn.Concat(j) is a container that computes the function defined by applying each of its child modules to a single
-- input, and then concatenating the respective outputs along dimension j
cat = nn.Concat(1) -- concatenate outputs along 1st dimension
cat:add(f)
cat:add(g)
print(cat:forward(x))
Out[7]:
 -0.3180
 -0.0968
  0.5655
  1.6661
  1.6822
 15.0000
[torch.DoubleTensor of size 6]

In [8]:
-- You can print a module to see its contents
print(h)
Out[8]:
nn.Sequential {
  [input -> (1) -> (2) -> output]
  (1): nn.CMul
  (2): nn.Sum
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.CMul
        {
          output : DoubleTensor - size: 5
          gradInput : DoubleTensor - empty
          _output : DoubleTensor - size: 5
          _repeat : DoubleTensor - empty
          _expand : DoubleTensor - size: 3x5
          gradWeight : DoubleTensor - size: 5
          _weight : DoubleTensor - size: 5
          size : LongStorage - size: 1
          weight : DoubleTensor - size: 5
        }
      2 : 
        nn.Sum
        {
          gradInput : DoubleTensor - empty
          
Out[8]:
dimension : 1
          output : DoubleTensor - size: 1
        }
    }
  output : DoubleTensor - size: 1
}

In [9]:
-- to access the children of containers you can use :get(i) or index into a list of children returned by .modules
print(h:get(1))
print(h.modules[1])
Out[9]:
nn.CMul
{
  output : DoubleTensor - size: 5
  gradInput : DoubleTensor - empty
  _output : DoubleTensor - size: 5
  _repeat : DoubleTensor - empty
  _expand : DoubleTensor - size: 3x5
  gradWeight : DoubleTensor - size: 5
  _weight : DoubleTensor - size: 5
  size : LongStorage - size: 1
  weight : DoubleTensor - size: 5
}
nn.CMul
{
  output : DoubleTensor - size: 5
  gradInput : DoubleTensor - empty
  _output : DoubleTensor - size: 5
  _repeat : DoubleTensor - empty
  _expand : DoubleTensor - size: 3x5
  gradWeight : DoubleTensor - size: 5
  _weight : DoubleTensor - size: 5
  size : LongStorage - size: 1
  weight : DoubleTensor - size: 5
}

2d. The Backward Pass

$\newcommand{\boldz}{\mathbf{z}}$ $\newcommand{\btheta}{\boldsymbol{\theta}}$

Suppose we have a module computing a function $h$ that participates in the definition of a loss function $L$. For a particular input $\boldx \in \reals^n$ let $\boldz \in \reals^m$ be defined by $\boldz = h(\boldx)$, which allows us to write our loss function as $L(\boldz)$. By the (multivariate) chain rule, the gradient of $L$ wrt $x_i$ is

\begin{align*} \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial x_i} \end{align*}

Assuming $L$ returns a scalar, we can rewrite the above for the entire $\boldx$ as

\begin{align*} \frac{\partial L}{\partial \boldx} = \left(\frac{\partial L}{\partial \boldz}\right)^T \frac{\partial \boldz}{\partial \boldx}, \end{align*}

where $\frac{\partial \boldz}{\partial \boldx}$ is the Jacobian, which lives in $\reals^{m \times n}$.

Each 'nn' module knows how to (implicitly) compute $\frac{\partial \boldz}{\partial \boldx}$ -- the gradient of its output wrt its input -- and so can compute $\frac{\partial L}{\partial \boldx}$ if it is also handed $\frac{\partial L}{\partial \boldz}$. In just the same way, if a module has parameters $\theta$, it knows how to calculate $\frac{\partial \boldz}{\partial \btheta}$, and can therefore calculate $\frac{\partial L}{\partial \btheta}$ if it is handed $\frac{\partial L}{\partial \boldz}$.

It's very important to know the 'nn' terminology for these gradients:

  • $\frac{\partial L}{\partial \boldx}$ is called 'gradInput' in nn; it's the gradient of the loss wrt a module's input

  • $\frac{\partial L}{\partial \boldz}$ is called 'gradOutput' in nn; it's the gradient of the loss wrt a module's output

  • $\frac{\partial L}{\partial \btheta}$ is called either 'gradWeight' or 'gradBias' in nn; it's the gradient of the loss wrt a module's parameters

Given $\frac{\partial \boldz}{\partial \boldx}$, an 'nn' module computes $\frac{\partial L}{\partial \boldx}$ with the :backward() function, and stores it in its 'gradInput' member, as follows:

In [10]:
gradOut = torch.randn(1) -- let's make up a random gradOutput for dL/dz (of same dimension as output of h)
print(gradOut)

-- let's now compute dL/dx with our gradOut
-- N.B. you MUST call :forward() before :backward() (and provide the same input); note we called :forward() above 
h:backward(x, gradOut) 
print(h.gradInput) -- N.B. each module also must have a gradInput member
Out[10]:
 1.6834
[torch.DoubleTensor of size 1]

-0.5353
-0.0815
 0.3173
 0.7012
 0.5664
[torch.DoubleTensor of size 5]

Let's check the gradients :backward() computed. Recall $h = g(f(\boldx))$ = nn.Sequential():add(f):add(g). Because $h$ is an nn.Sequential/composition, it should first get gradient of $g$ wrt its input, which is $f(\boldx)$. So, we get \begin{align} \frac{\partial L}{\partial f(\boldx)} = \frac{\partial L}{\partial h(\boldx)} \cdot \frac{\partial h(\boldx)}{\partial f(\boldx)} = gradOut \cdot g'(f(\boldx)) \end{align}

Since $g$ just sums, $g'(\boldx)_i = 1$ for each $i$, and so $\frac{\partial L}{\partial f(\boldx)} = gradOut \cdot$ torch.ones(x:size())

In [11]:
print(g.gradInput)
assert((g.gradInput - torch.ones(x:size()):mul(gradOut[1])):abs():max() < 1e-10)
Out[11]:
 1.6834
 1.6834
 1.6834
 1.6834
 1.6834
[torch.DoubleTensor of size 5]

Now that we have $\frac{\partial L}{\partial f(\boldx)}$, we can calculate $\frac{\partial L}{\partial \boldx}$ as $(\frac{\partial L}{\partial f(\boldx)})^T \frac{\partial f}{\partial \boldx}$. Since $f(\boldx)_i = w_i \cdot x_i$, we have that $\frac{\partial f_i}{\partial x_i} = w_i$, and is 0 everywhere else. Thus, $\frac{\partial f}{\partial \boldx}$ is g.gradInput$^T diag(\boldw) =$ g.gradInput $\odot \boldw$.

In [12]:
assert((f.gradInput - torch.cmul(g.gradInput, f.weight)):abs():max() < 1e-10)

-- (Note that f.gradInput == h.gradInput, since h = g(f(x)))

In addition to computing $\frac{\partial L}{\partial \boldx}$, backward() also computes $\frac{\partial L}{\partial \btheta}$, where $\btheta$ are the module's parameters. Specifically, modules accumulate the gradients wrt their parameters in their 'gradWeight' and 'gradBias' members. So, let's redo the above example, this time paying attention to parameters.

In [13]:
-- since backward() accumulates (i.e., adds) gradients, we need to start by zeroing out gradWeight and gradBias
h:zeroGradParameters() -- N.B. calling zeroGradParameters() on a container recursively zeroes grads on children
h:backward(x, gradOut)
print(f.gradWeight)

-- let's check that gradient was correct, using a calculation similar to the one used above for dL/dx
assert((f.gradWeight - torch.cmul(g.gradInput, x)):abs():max() < 1e-10)
Out[13]:
 1.6834
 3.3668
 5.0503
 6.7337
 8.4171
[torch.DoubleTensor of size 5]

2e. Module Internals

Now that we know about :forward() and :backward() let's get a more precise sense of how they work. You'll need to know this if you ever want to implement your own modules!

In [ ]:
-- Recall that Module is an abstract class. The (abstract) functions :forward() and :backward() are defined in terms
-- of 3 functions subclasses must implement: updateOutput(), updateGradInput(), accGradParameters()

-- The below code is from https://github.com/torch/nn/blob/master/Module.lua; the comments are mine
function Module:forward(input)
   return self:updateOutput(input) -- subclasses must implement updateOutput, which sets self.output
end

function Module:backward(input, gradOutput, scale)
   scale = scale or 1
   self:updateGradInput(input, gradOutput) -- subclasses must implement updateGradInput, which sets self.gradInput
   self:accGradParameters(input, gradOutput, scale) -- subclasses must add dL/d\theta to self.gradWeight etc
   return self.gradInput
end
In [ ]:
-- here are some very simplified versions of these 3 functions for the CMul module (with new comments),
-- adapted from https://github.com/torch/nn/blob/master/CMul.lua

-- N.B. CMul inherits from module, and so has .output, and .gradInput members; 
-- because it has parameters, it also has .weight and .gradWeight members 

function CMul:updateOutput(input)   
   self.output:resizeAs(input):copy(input) -- self.output = input 
   self.output:cmul(self.weight)           -- self._output = self._output .* self._weight
   return self.output
end

function CMul:updateGradInput(input, gradOutput)  
   self.gradInput:resizeAs(input):zero()   -- zero out our gradInput storer
   self.gradInput:addcmul(1, self.weight, gradOutput) -- self.gradInput = self.gradOutput .* self.weight
   return self.gradInput
end

function CMul:accGradParameters(input, gradOutput, scale)
   scale = scale or 1
   -- don't zero out gradWeight, because we're accumulating!
   self.gradWeight:addcmul(scale, input, gradOutput)  -- self.gradWeight = self.gradOutput .* self.input
end

2f. Some Useful Modules with Parameters

In [14]:
-- In addition to nn.CMul, you will likely want to know about
lin = nn.Linear(x:size(1), 3) -- computes Wx + b, where W \in R^{5 x 3} and b \in R^3
print(lin:forward(x))
Out[14]:
-2.8031
 1.9383
-1.8525
[torch.DoubleTensor of size 3]

In [15]:
-- LookupTables will be extremely important for this course; they map indices to corresponding weight vectors
LT = nn.LookupTable(5, 3) -- maps indices (1 thru 5) to corresponding weight vectors, which live in R^3

-- let's look at a LookupTable's weights
print(LT.weight)

-- LookupTables take indices as input!
idxs = torch.LongTensor({1,2,5})
print(LT:forward(idxs)) -- extracts 1st, 2nd, and 5th rows of weights

-- can also batch input to a LookupTable, as follows
batchIdxs = torch.LongTensor({{1, 3}, {4, 5}, {2, 3}}) -- here, there are 3 examples, each associated with 2 idxs
print(LT:forward(batchIdxs))
Out[15]:
-0.3408  0.8594  1.2139
 0.1566 -0.5897  0.1788
-0.0315  0.1821 -0.7354
 0.1246  0.2621 -0.0320
 1.1030 -0.9798  0.3587
[torch.DoubleTensor of size 5x3]

-0.3408  0.8594  1.2139
 0.1566 -0.5897  0.1788
 1.1030 -0.9798  0.3587
[torch.DoubleTensor of size 3x3]

(1,.,.) = 
 -0.3408  0.8594  1.2139
 -0.0315  0.1821 -0.7354

(2,.,.) = 
  0.1246  0.2621 -0.0320
  1.1030 -0.9798  0.3587

(3,.,.) = 
  0.1566 -0.5897  0.1788
 -0.0315  0.1821 -0.7354
[torch.DoubleTensor of size 3x2x3]

In [16]:
-- nn.Add computes x + b, where b \in R^5 (though can also be used to a single constant)
add = nn.Add(x:size()) 
print(add:forward(x))

-- there are many more (esp. convolutions, which we'll talk about later in the course)!
Out[16]:
 1.4369
 2.2328
 2.5575
 4.2737
 4.6302
[torch.DoubleTensor of size 5]

2g. Some Useful Modules Without Parameters

In [17]:
-- non-linearities/'transfer' functions
x = torch.randn(5)
nonlin1 = nn.Sigmoid()
nonlin2 = nn.LogSoftMax()
nonlin3 = nn.Tanh()
nonlin4 = nn.ReLU()

print(nonlin1:forward(x))
print(nonlin2:forward(x))
print(nonlin3:forward(x))
print(nonlin4:forward(x))

-- other mathematical operations
X = torch.randn(3, 2)
op1 = nn.Max(1, 2) -- maxes over dimension 1, expects 2d input
op2 = nn.Mean(2, 2) -- means over dimension 2, expects 2d input
op3 = nn.Abs() 

print(op1:forward(X))
print(op2:forward(X))
print(op3:forward(X))

-- there are also Modules that reshape or review their arguments; one you'll use most often is nn.View,
-- which takes in the desired dimension sizes
print(nn.View(2,3):forward(X))

-- there are many more!
Out[17]:
 0.6790
 0.5264
 0.6779
 0.6240
 0.5938
[torch.DoubleTensor of size 5]

Out[17]:
-1.3853
-2.0289
-1.3905
-1.6280
-1.7545
[torch.DoubleTensor of size 5]

 0.6346
 0.1051
 0.6315
 0.4672
 0.3626
[torch.DoubleTensor of size 5]

 0.7491
 0.1055
 0.7440
 0.5065
 0.3799
[torch.DoubleTensor of size 5]

 0.9011
 2.5226
[torch.DoubleTensor of size 2]

 0.4434
 0.2421
 1.2315
[torch.DoubleTensor of size 3]

 0.9011  0.0143
 0.8167  1.3009
 0.0597  2.5226
[torch.DoubleTensor of size 3x2]

Out[17]:
 0.9011 -0.0143 -0.8167
 1.3009 -0.0597  2.5226
[torch.DoubleTensor of size 2x3]

2h. Advanced Containers/Table-Layers

All the containers (and other modules) we've seen so far take in single Tensors as arguments. This won't be sufficient if we want functions of multiple inputs (especially if they're of different sizes or types).

As a motivating example, suppose we want to make a Linear-like layer over both sparse and dense features. That is, we want to compute

\begin{align*} \left[ \mathbf{W}_o \mathbf{W}_d \right] \begin{bmatrix} \boldx_o \\ \boldx_d \end{bmatrix} + \mathbf{b}, \end{align*}

where matrices $\mathbf{W}_o$ and $\mathbf{W}_d$ are concatenated horizontally, and a one-hot vector $\boldx_o$ is stacked on top of a dense vector $\boldx_d$ (and $\mathbf{b}$ is a bias). Note that the above is equivalent to $\mathbf{W}_o \boldx_o + \mathbf{W}_d \boldx_d + \mathbf{b}$. Moreover, since we know that $\mathbf{W}_o \boldx_o$ is equivalent to a lookup in a LookupTable, we can do the following:

In [18]:
D_o, D_d, D_h = 5, 3, 2 -- width of W_o, width of W_d, height of both W_o and W_d
x_o = torch.LongTensor({2}) -- index equivalent of [0 1 0 0 0]
x_d = torch.randn(1, D_d)

-- our first example of a Table layer/container
par = nn.ParallelTable() -- takes a TABLE of inputs, applies i'th child to i'th input, and returns a table
par:add(nn.LookupTable(D_o, D_h)) -- first child
par:add(nn.Linear(D_d, D_h)) -- second child

-- this parallel table produces a table of 2 1xD_h tensors corresponding to W_o x_o and W_d x_d + b resp.
print(par:forward({x_o, x_d}))
Out[18]:
{
  1 : DoubleTensor - size: 1x2
  2 : DoubleTensor - size: 1x2
}
In [19]:
-- to get our full linear transformation, we need to add the two tables. 
-- as usual, to compose functions in order we use nn.Sequential
spAndDenseLinear = nn.Sequential()
spAndDenseLinear:add(par)
spAndDenseLinear:add(nn.CAddTable()) -- CAddTable adds its incoming tables

-- let's look at spAndDenseLinear
print(spAndDenseLinear)
print()
Out[19]:
nn.Sequential {
  [input -> (1) -> (2) -> output]
  (1): nn.ParallelTable {
    input
      |`-> (1): nn.LookupTable
      |`-> (2): nn.Linear(3 -> 2)
       ... -> output
  }
  (2): nn.CAddTable
}
{
  gradInput : table: 0x40e50d38
  modules : 
    {
      1 : 
        nn.ParallelTable {
          input
            |`-> (1): nn.LookupTable
            |`-> (2): nn.Linear(3 -> 2)
             ... -> output
        }
        {
          gradInput : table: 0x40e50d38
          modules : 
            {
              1 : 
                nn.LookupTable
                {
                  copiedInput : false
                  weight : DoubleTensor - size: 5x2
                  shouldScaleGradByFreq : false
              
Out[19]:
    gradWeight : DoubleTensor - size: 5x2
                  gradInput : DoubleTensor - empty
                  _count : IntTensor - empty
                  _input : LongTensor - empty
                  output : DoubleTensor - size: 1x2
                }
              2 : 
                nn.Linear(3 -> 2)
                {
                  gradBias : DoubleTensor - size: 2
                  weight : DoubleTensor - size: 2x3
                  bias : DoubleTensor - size: 2
                  gradInput : DoubleTensor - empty
                  addBuffer : DoubleTensor - size: 1
                  gradWeight : DoubleTensor - size: 2x3
                  output : DoubleTensor - size: 1x2
                }
            }
          output : 
            {
              1 : DoubleTensor - size: 1x2
              2 : DoubleTensor - size: 1x2
  
Out[19]:
          }
        }
      2 : 
        nn.CAddTable
        {
          gradInput : table: 0x41759cc8
          output : DoubleTensor - empty
        }
    }
  output : DoubleTensor - empty
}

In [20]:
-- finally, let's use spAndDenseLinear to compute W_o x_o + W_d x_d + b
print(spAndDenseLinear:forward({x_o, x_d}))
Out[20]:
 1.2212 -2.0951
[torch.DoubleTensor of size 1x2]


Note that table containers/layers allow networks to take tables as input and produce them as output. Here we show some more layers that are useful when dealing with table inputs.

In [21]:
t = {torch.randn(1, 3), torch.randn(1, 3), torch.randn(1, 2)}

-- JoinTable(dim, nDims) makes a tensor from a table of tensors (of nDims dimensions) by concating along dim
print(nn.JoinTable(2, 2):forward(t)) 
print()
-- NarrowTable(offset, len) returns len tables starting at offset
print(nn.NarrowTable(1, 2):forward(t))
Out[21]:
-0.6461  2.0589 -0.7707  0.0129 -0.6880  0.9008  0.0479  1.5202
[torch.DoubleTensor of size 1x8]


{
  1 : DoubleTensor - size: 1x3
  2 : DoubleTensor - size: 1x3
}

3. Criteria

Criterion objects are used to represent loss functions. They are similar to modules in that you can call :forward() and :backward() on them, and that they have .output and .gradInput members. The major difference is that :forward() takes 2 arguments, viz., scores/predictions and the true scores/labels.

In [22]:
mse = nn.MSECriterion() -- mean squared error criterion, often used for (scalar) regression loss
y = torch.randn(3) -- scalar targets
yhat = torch.zeros(3) -- scalar predictions
print(mse:forward(yhat, y)) -- returns MSE = 1/n sum_i (yhat_i - t_i)^2
print()

-- can remove 1/n factor as follows
mse.sizeAverage = false
print(mse:forward(yhat, y))
print()

-- get gradient of per-example Losses wrt predictions using :backward()
dLdyhat = mse:backward(yhat, y)
print(dLdyhat)
print()

-- generally will pass dLdyhat as gradOutput when calling :backward() on network that computed yhat
Out[22]:
1.8225688302461	

5.4677064907383	

 2.3650
-1.6119
 3.6986
[torch.DoubleTensor of size 3]


In [ ]:
-- here's a classification criterion you're likely to use:
nllcrit = nn.ClassNLLCriterion() -- log loss for multiclass classification; expects log-probabilities and true class

Z = torch.randn(2, 3) -- 3-class classification scores for 2 examples
Yhat = nn.LogSoftMax():forward(Z) -- make log probabilities
Y = torch.Tensor({3,1}) -- true classes

print(nllcrit:forward(Yhat, Y))
print()
print(nllcrit:backward(Yhat, Y)) -- N.B. ClassNLLCriterion (by default) divides by numExamples, which affects grads

4. Training

Here we'll show how to train a 1-layer neural network on a regression-style task.

In [3]:
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)
y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))

-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()
mlp:add(nn.Linear(F,H))
mlp:add(nn.Tanh())
mlp:add(nn.Linear(H, 1))
-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))

-- now define our criterion
mse = nn.MSECriterion()

-- we can flatten (and then retrieve) all parameters (and gradParameters) of a module in the following way:
params, gradParams = mlp:getParameters() -- N.B. getParameters() moves around memory, and should only be called once!
eta = 0.01

-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
    -- zero out our gradients
    gradParams:zero()
    -- do forward pass
    preds = mlp:forward(X)
    -- get loss
    loss = mse:forward(preds, y)
    print("epoch " .. i .. ", loss: " .. loss)
    -- backprop
    dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
    mlp:backward(X, dLdpreds)
    -- update params with sgd step
    params:add(-eta, gradParams)
end
Out[3]:
epoch 1, loss: 1.4847964912907	
Out[3]:
epoch 2, loss: 1.4471538022835	
Out[3]:
epoch 3, loss: 1.4123791149845	
Out[3]:
epoch 4, loss: 1.3801837665443	
Out[3]:
epoch 5, loss: 1.3503120157518	

While extracting parameters and gradParameters with :getParameters() can often be useful, especially if you want to hand them to more sophisticated optimization algorithms (e.g., in the 'optim' package), if you're just doing (S)GD you can also use the module function :updateParameters(). Here's the same example as above using :updateParameters()

In [4]:
-- let's generate some data
torch.manualSeed(287)
N = 5 -- num examples
F = 4 -- num features
X = torch.randn(N, F)
y = torch.mv(X, torch.randn(F)):pow(2):add(torch.randn(N))

-- let's create a 1-layer MLP
H = 3 -- hidden layer size
mlp = nn.Sequential()
mlp:add(nn.Linear(F,H))
mlp:add(nn.Tanh())
mlp:add(nn.Linear(H, 1))
-- note above equivalent to mlp = nn.Sequential():add(nn.Linear(F,H)):add(nn.Tanh()):add(nn.Linear(H,1))

-- now define our criterion
mse = nn.MSECriterion()

eta = 0.01

-- now that we have our parameters flattened, we'll train with very simple SGD
-- note that all operations are batched across all of X
nEpochs = 5
for i = 1, nEpochs do
    -- zero out our gradients
    mlp:zeroGradParameters()
    -- do forward pass
    preds = mlp:forward(X)
    -- get loss
    loss = mse:forward(preds, y)
    print("epoch " .. i .. ", loss: " .. loss)
    -- backprop
    dLdpreds = mse:backward(preds, y) -- gradients of loss wrt preds
    mlp:backward(X, dLdpreds)
    -- update params with sgd step
    mlp:updateParameters(eta) -- computes parameters = parameters - eta * gradient
end
Out[4]:
epoch 1, loss: 1.4847964912907	
Out[4]:
epoch 2, loss: 1.4471538022835	
Out[4]:

Out[4]:
epoch 3, loss: 1.4123791149845	
Out[4]:
epoch 4, loss: 1.3801837665443	
Out[4]:
epoch 5, loss: 1.3503120157518	

5. Common Bugs

If your code doesn't seem to be working, here are some things to check:

  • Gradient Tensors are zeroed out before each epoch
  • :backward() receives a gradOuptut argument of same dimension as input
  • Batching happens only along first dimension
  • Used the correct sign in gradient update

Also worth noting that if you're doing something vaguely complicated/non-standard with 'nn', you should always check your gradients (with finite differences)!

6. Exercise: Implement a Multi-class Hinge Loss Criterion

(Note that the MultiMarginCriterion in 'nn' isn't quite what we saw in class).

Recall

\begin{align*} L_{hinge}(\hat{\mathbf{y}}, \mathbf{y}) = \max\{0, 1 - (\hat{\mathbf{y}}_c - \hat{\mathbf{y}}_{c'})\} \end{align*}

,

where $c' = \mathrm{argmax}_{i \in \mathcal{C} - \{c\}} \hat{y}_i$