Getting Started with PyTorch

In [1]:
__author__ = "Ignacio Cases"
__version__ = "CS224U, Stanford, Spring 2019"


Getting Started

PyTorch is a Python package designed to carry out scientific computation. We use PyTorch in a range of different environments: to develop our models in our local machines, up to large scale deployment for training on big clusters, and even to perform inference in embedded, low-power systems. While similar in many aspects to NumPy, PyTorch enables us to perform fast and efficient training of deep learning and reinforcement learning models not only in the CPU but also in a GPU or other ASICs (Application Specific Integrated Circuits) for AI, such as Tensor Processing Units (TPU).

Importing PyTorch

This tutorial assumes a working install of PyTorch using the nlu kernel, but the content applies to any regular install of PyTorch. If you don't have a working installation of PyTorch, please follow the instructions on the course repo.

To get started working with PyTorch we simply begin by importing the torch module,

In [2]:
import torch

Side note: why not import pytorch? The name of the package is torch for historical reasons: torch is the orginal name of the ancestor of the PyTorch library that got started back in 2002 as a C library with Lua scripting. It was only much later that the original torch was ported to Python. The PyTorch project decided to prefix the Py to make clear that this library refers to the Python version, as it was confusing back then to know which torch one was referring to. All the internal mentions to the library use just torch. It's possible that PyTorch gets renamed at some point as the original torch is no longer maintained and there is no longer confusion.

We can verify the version installed and whether or not we have a GPU-enabled PyTorch install by issuing

In [3]:
print("PyTorch version {}".format(torch.__version__))
print("GPU-enabled installation? {}".format(torch.cuda.is_available()))
PyTorch version 1.0.1.post2
GPU-enabled installation? True

PyTorch has a good documentation but it can take some time to familiarize with the structure of the pacakge; we really recommended that you familiarize yourself with it. We will also make use of other imports:

In [4]:
import numpy as np


Tensors collections of numbers represented as an array, and are the basic building blocks in PyTorch.

You are probably already familiar with several types of tensors:

  • A scalar, a single number, is a zero-th order tensor.

  • A column vector $v$ of dimensionality $d_c \times 1$ is another (very special) type of tensor of order 1.

  • A row vector $x$ of dimensionality $1 \times d_r$ is another (also very special) type of tensor of order 1.

  • A matrix $A$ of dimensionality $d_r \times d_c$ is yet another type of tensor of order 2.

  • A cube $T$ of dimensionality $d_r \times d_c \times d_d$ is also a tensor of order 3. An image can be thought of a tensor of order 3, where we use the first two dimensions encode pixel luminosity for a given channel that is specified by the third dimension (the color plane).

For our purposes, tensors are the fundamental blocks that carry information in our mathematical model, and they are composed using several operations creating a mathematical graph in which information can flow (propagate) forward (functional application) and backwards (using the chain rule).

We have seen multidimensional arrays in Numpy: although they are not called the same way, these NumPy objects are also a representation of tensors.

Side note: what is really a tensor? Tensors are important mathematical objects with applications in multiple domains in Mathematics and Physics. The term tensor comes from the usage of these mathematical objects to describe the stretching of a volume of matter under tension. They are central objects of study in a subfield of Mathematics known as Differential Geometry, that deals with the geometry of continuous vector spaces. As a very high-level summary (and as first approximation), tensors are defined as multi-linear "machines" that a number of slots (their order, aka rank), taking a number of "column" vectors and "row" vectors to produce a scalar. For example, a tensor $\mathbf{A}$ (represented by a matrix with rows and columns that you could write in a sheet of paper) can be thought of having two slots. So when $\mathbf{A}$ acts upon a column vector $\mathbf{v}$ and a row vector $\mathbf{x}$, it returns a scalar.

$\mathbf{A}(\mathbf{x}, \mathbf{v}) = s$

If $\mathbf{A}$ only acts on the column vector, for example, the result will be another column tensor $\mathbf{u}$ of one order less than the order of $\mathbf{A}$. Thus, when $\mathbf{v}$ acts is similar to "removing" its slot:

$\mathbf{u} = \mathbf{A}(\mathbf{v})$

The resulting $\mathbf{u}$ can later interact with another row vector to produce a scalar or be used in any other way.

This can be a very powerful way of thinking about tensors as their slots can guide you when writing the code, specially given that PyTorch has a functional approach to modules where this view is very much highlighted. As we will see below, these simple equations above have an completely straighforward representation in the code. At the end, most of what our models will do is to process the input using this type of functional application so that we end up having a tensor output and a scalar value that measures how good our output is with respect to the real output value in the dataset.

Tensor Creation

Let's get started with tensors in PyTorch. The framework supports eight different types (Lapan 2018):

  • 3 float types (16-bit, 32-bit, 64-bit): torch.FloatTensor is the class name for the commonly used 32-bit tensor.
  • 5 integer types (signed 8-bit, unsigned 8-bit, 16-bit, 32-bit, 64-bit): common tensors of these types are the 8-bit unsigned tensor torch.ByteTensor and the 64-bit torch.LongTensor.

There are three fundamental ways to create tensors in PyTorch (Lapan 2018):

  • Call a tensor constructor of a given type, which will create a non-initialized tensor. So we then need to fill this tensor later to be able to use it.
  • Call a built-in method in the torch module that returns a tensor that is already initialized.
  • By using the PyTorch-NumPy bridge.

Calling the constructor

Let's first create a 2 x 3 dimensional tensor of the type float:

In [5]:
t = torch.FloatTensor(2, 3)
print(t, t.size())
tensor([[4.2460e+13, 4.5911e-41, 4.2460e+13],
        [4.5911e-41, 0.0000e+00, 0.0000e+00]]) torch.Size([2, 3])

Note that we specified the dimensions as the arguments to the constructor by passing directly the numbers -- and not a list or a tuple, which would have very different outcomes as we will see below! We can always inspect the size of the tensor using the size() method.

The constructor method allocates space in memory for this tensor. Note however that the tensor is non-initialized for our purposes in Deep Learning. In order to initialize it we need to call any of the tensor initialization methods built-in in the basic tensor types. For example, the tensor we just created has a built-in method zero_():

In [6]:
tensor([[0., 0., 0.],
        [0., 0., 0.]])

The underscore after the method name is important: it means that the operation happens in place, this is, the returned object is the same object but now with different content. A very handy way to construct a tensor using the constructor happens when we have available the content we want to put in the tensor in the form of a Python iterable: in this case we just pass it as the argument to the constructor:

In [7]:
torch.FloatTensor([[1, 2, 3], [4, 5, 6]])
tensor([[1., 2., 3.],
        [4., 5., 6.]])

Calling a method in the torch module

A very convenient way to create tensors, in addition to using the constructor method, is to use one of the multiple methods provided in the torch module. In particular, the tensor method allows us to pass a number or iterable as the argument to get the appropriately typed tensor:

In [8]:
tl = torch.tensor([1, 2, 3])
t = torch.tensor([1., 2., 3.])
print("A 64-bit integer tensor: {}, {}".format(tl, tl.type()))
print("A 32-bit float tensor: {}, {}".format(t, t.type()))
A 64-bit integer tensor: tensor([1, 2, 3]), torch.LongTensor
A 32-bit float tensor: tensor([1., 2., 3.]), torch.FloatTensor

We can create a similar 2x3 tensor to the one above by using the torch.zeros() method passing a sequence of dimensions to it:

In [9]:
t = torch.zeros(2, 3)
tensor([[0., 0., 0.],
        [0., 0., 0.]])

We can find a large variety of methods to create tensors there. We list some useful ones:

In [10]:
t_zeros = torch.zeros_like(t)            # zeros_like returns a new tensor
t_ones = torch.ones(2, 3)                # creates a tensor with 1s
t_fives = torch.empty(2, 3).fill_(5)     # creates a non-initialized tensor and fills it with 5
t_random = torch.rand(2, 3)              # creates a uniform random tensor
t_normal = torch.randn(2, 3)             # creates a normal random tensor with the specified dimensions

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[5., 5., 5.],
        [5., 5., 5.]])
tensor([[0.7590, 0.1273, 0.4451],
        [0.6683, 0.0196, 0.0866]])
tensor([[-1.7895, -0.7535, -0.4807],
        [ 0.5964,  1.4192, -0.5725]])

We now see emerging two important paradigms in PyTorch: an imperative approach to performing operations, using inplace methods, is in marked contrast with an additional paradigm also used in PyTorch, the functional approach, where the returned object is a copy of the original object. Both paradigms have their specific use cases as we will be seeing below. The rule of thumb is that inplace methods are faster and don't require extra memory allocation in general, but they can be tricky to understand (keep this in mind regarding the computational graph that we will see below). Functional methods make the code referentially transparent, which is a highly desired property that makes easier to understand the underlying math, but we rely on the effiency of the implementation:

In [11]:
t1 = torch.clone(t)      # creates a new copy of the tensor that is still linked to the computational graph (see below)
assert id(t) != id(t1), 'Functional methods create a new copy of the tensor'

# To create a new _independent_ copy, we do need to detach from the graph
t1 = torch.clone(t).detach()

Using the PyTorch-NumPy bridge

A quite useful feature of PyTorch is the almost seamless integration with NumPy that allows us to perform operations on NumPy and interact from PyTorch with the large number of NumPy libraries as well. Converting a NumPy multi-dimensional array into a PyTorch tensor is very simple: we only need to call the tensor method with NumPy objects as the argument:

In [12]:
# Create a new multi-dimensional array in NumPy with the np datatype (np.float32)
a = np.array([1., 2., 3.])

# Convert the array to a torch tensor
t = torch.tensor(a)

print("NumPy array: {}, type: {}".format(a, a.dtype))
print("Torch tensor: {}, type: {}".format(t, t.dtype))
NumPy array: [1. 2. 3.], type: float64
Torch tensor: tensor([1., 2., 3.], dtype=torch.float64), type: torch.float64

We can convert a PyTorch tensor into a NumPy array also seamlessly:

In [13]:
array([1., 2., 3.])

Side note: why not torch.from_numpy(a)? The from_numpy() method is depecrated in favor of tensor(), which is a more capable method in the torch package. from_numpy() is only there for backwards compatibility. It can be a little bit quirky, so I recommend using the newer method in PyTorch >= 0.4.


Indexing works as expected with NumPy:

In [14]:
t = torch.randn(2, 3)
tensor([-0.7830,  1.1622])

PyTorch also supports indexing using long tensors, for example:

In [15]:
t = torch.randn(5, 6)
i = torch.tensor([1, 3])
j = torch.tensor([4, 5])
print(t[i])                          # selects rows 1 and 3
print(t[i, j])                       # selects (1, 4) and (3, 5)
tensor([[-0.5929,  0.0931, -0.3181,  1.0272, -0.3327,  0.3270],
        [ 0.1875, -0.8570, -0.6022, -0.3721, -0.1816, -1.3812],
        [ 0.5273, -0.9770,  1.2352,  0.7568, -0.0677, -1.0433],
        [-0.3046,  0.4589,  0.1333, -0.3969,  0.0188,  0.1917],
        [ 0.9501,  0.9675,  1.9239,  1.1949,  1.2277,  0.0470]])
tensor([[ 0.1875, -0.8570, -0.6022, -0.3721, -0.1816, -1.3812],
        [-0.3046,  0.4589,  0.1333, -0.3969,  0.0188,  0.1917]])
tensor([-0.1816,  0.1917])

Type conversion

Each tensor has a set of convenient methods to convert types. For example, if we want to convert the tensor above to a 32-bit float tensor, we use the method .float():

In [16]:
t   = t.float()   # converts to 32-bit float
t   = t.double()  # converts to 64-bit float
t   = t.byte()    # converts to unsigned 8-bit integer
tensor([[-0.5929,  0.0931, -0.3181,  1.0272, -0.3327,  0.3270],
        [ 0.1875, -0.8570, -0.6022, -0.3721, -0.1816, -1.3812],
        [ 0.5273, -0.9770,  1.2352,  0.7568, -0.0677, -1.0433],
        [-0.3046,  0.4589,  0.1333, -0.3969,  0.0188,  0.1917],
        [ 0.9501,  0.9675,  1.9239,  1.1949,  1.2277,  0.0470]])
tensor([[-0.5929,  0.0931, -0.3181,  1.0272, -0.3327,  0.3270],
        [ 0.1875, -0.8570, -0.6022, -0.3721, -0.1816, -1.3812],
        [ 0.5273, -0.9770,  1.2352,  0.7568, -0.0677, -1.0433],
        [-0.3046,  0.4589,  0.1333, -0.3969,  0.0188,  0.1917],
        [ 0.9501,  0.9675,  1.9239,  1.1949,  1.2277,  0.0470]],
tensor([[  0,   0,   0,   1,   0,   0],
        [  0,   0,   0,   0,   0, 255],
        [  0,   0,   1,   0,   0, 255],
        [  0,   0,   0,   0,   0,   0],
        [  0,   0,   1,   1,   1,   0]], dtype=torch.uint8)

Operations on Tensors

Now that we know how to create tensors, let's create some of the fundamental tensors and see some common operations on them:

In [17]:
# Scalars 
s = torch.tensor(42) # creates a tensor with a scalar (zero-th order tensor, i.e. just a number)

Tip: a very convenient way in newer PyTorch versions is to access to scalars with .item():

In [18]:

Let's see higher-order tensors -- remember we can always inspect the dimensionality of a tensor using the .size() method:

In [19]:
# Row vector
x = torch.randn(1,3)
print("Row vector {} with size {}".format(x, x.size()))

# Column vector
v = torch.randn(3,1)
print("Column vector {} with size {}".format(v, v.size()))

# Matrix
A = torch.randn(3, 3)
print("Matrix {} with size {}".format(A, A.size()))
Row vector tensor([[ 0.9739, -1.7025,  0.1483]]) with size torch.Size([1, 3])
Column vector tensor([[0.6065],
        [1.0098]]) with size torch.Size([3, 1])
Matrix tensor([[-0.1813, -0.5609,  0.4626],
        [-1.0871, -0.4376, -0.9243],
        [ 0.5843, -0.7621, -0.6608]]) with size torch.Size([3, 3])

A common operation is matrix-vector multiplication (and in general tensor-tensor multiplication). For example, the product $\mathbf{A}\mathbf{v} + \mathbf{b}$ is as follows:

In [20]:
u = torch.matmul(A, v)
b = torch.randn(3,1)
y = u + b                    # we can also do torch.add(u, b)

where we retrieve the expected result (a column vector of dimensions 3x1). We of course can compose operations:

In [21]:
s = torch.matmul(x, torch.matmul(A, v))

There are many functions implemented for every tensor, and we encourage you to study the documentation. Some of the most common ones:

In [22]:
# common tensor methods (they also have the counterpart in the torch package, e.g. as torch.sum(t))
t = torch.randn(2,3)
t.t()                   # transpose
t.numel()               # number of elements in tensor
t.nonzero()             # indices of non-zero elements
t.view(-1, 2)           # reorganizes the tensor to these dimensions
t.squeeze()             # removes size 1 dimensions
t.unsqueeze(0)          # inserts a dimension

# operations in the package
torch.arange(0, 10)     # tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
torch.eye(3, 3)         # creates a 3x3 matrix with 1s in the diagonal (identity in this case)
t = torch.arange(0, 3), t))       # tensor([0, 1, 2, 0, 1, 2])
torch.stack((t, t))     # tensor([[0, 1, 2],
                        #         [0, 1, 2]])
tensor([[0, 1, 2],
        [0, 1, 2]])

GPU computation

Deep Learning frameworks take advantage of the powerful computational capabilities of modern graphic processing units (GPUs). GPUs were originally designed to perform frequent operations for graphics very efficiently and fast, such as linear algebra operations, which makes them ideal for our interests. PyTorch makes very easy to use the GPU: the common scenario is to tell the framework that we want to instantiate a tensor with a type that makes it a GPU tensor, or move a given CPU tensor to the GPU. All the tensors that we have seen above are CPU tensors, and PyTorch has the counterparts for GPU tensors in the torch.cuda module. Let's see how this works.

A common way to explicitly declare the tensor type as a GPU tensor is through the use of the constructor method for tensor creation inside the torch.cuda module:

In [23]:
t_gpu = torch.cuda.FloatTensor(3, 3)   # creation of a GPU tensor
t_gpu.zero_()                          # initialization to zero
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], device='cuda:0')


However, a more common approach that gives us flexibility is through the use of devices. A device in PyTorch refers to either the CPU (indicated by the string "cpu") or one of the possible GPU cards in the machine (indicated by the string "cuda:$n$", where $n$ is the index of the card). Let's create a random gaussian matrix using a method from the torch package, and set explicitly the computational device to be the GPU by specifying the device to be cuda:0, our first GPU card in our machine (this code will fail if you don't have a GPU, but we will work around that below):

In [24]:
t_gpu = torch.randn(3, 3, device="cuda:0")
tensor([[ 0.5961,  0.3101, -0.6615],
        [ 1.1299,  1.2280, -1.0802],
        [ 0.5733,  0.8676, -0.6882]], device='cuda:0')

As you can notice, the tensor now has the explicit device set to be a CUDA device, not a CPU device. Let's now create a tensor in the CPU and move it to the GPU:

In [25]:
t = torch.randn(3, 3)   # we could also state explicitly the device to be the CPU with torch.randn(3,3,device="cpu")
tensor([[-1.0262,  0.2573, -1.2256],
        [-0.5341, -1.2264,  0.3200],
        [ 0.9489, -0.6399, -0.8682]])

Note that in this case the device is the CPU, but PyTorch does not explictly say that given that this is the default behavior. To copy the tensor to the GPU we use the .to() method that every tensor implements, passing the device as an argument. This method creates a copy in the specified device or, if the tensor already resides in that device, it returns the original tensor (Lapan 2018):

In [26]:
t_gpu ="cuda:0")  # copies the tensor from CPU to GPU
# note that if we do now"cuda:0") it will return the same tensor without doing anything else as this tensor already resides on the GPU
tensor([[-1.0262,  0.2573, -1.2256],
        [-0.5341, -1.2264,  0.3200],
        [ 0.9489, -0.6399, -0.8682]], device='cuda:0')

Tip: When we program PyTorch models, we will have to specify the device in several places (not so many, but definitely more than once). A good practice that is consistent accross the implementation and makes the code more portable is to declare early in the code a device variable by querying the framework if there is a GPU available that we can use. We can do this by writing

In [27]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

We can then use device as an argument of the .to() method in the rest of our code:

In [28]:   # moves t to the device (this code will **not** fail if the local machine has not access to a GPU)
tensor([[-1.0262,  0.2573, -1.2256],
        [-0.5341, -1.2264,  0.3200],
        [ 0.9489, -0.6399, -0.8682]], device='cuda:0')

Side note: having a good GPU backend support is a critical aspect of a Deep Learning framework -- even so that some models depend crucially on performing computations on a GPU. Most frameworks, including PyTorch, only provide good support for GPUs manufactured by Nvidia. This is mostly due to the heavy investment this company made on CUDA (Compute Unified Device Architecture), the underlying parallel computing platform that enable this type of scientific computing (and the reason for the device label), with specific implementations targeted to Deep Neural Networks as cuDNN. Other GPU manufacturers, most notably AMD, are making efforts to towards enabling ML computing in their cards, but their support is still partial.

Neural Network Foundations

Computing gradients is a crucial feature in deep learning, given that the training procedure of neural networks rely on optimization techinques that update the parameters of the model by using the gradient information of a scalar magnitude -- the loss function. How is it possible to compute the derivatives? There are different methods, namely

  • Symbolic Differentiation: given a symbolic expression, the software provides the derivative by performing symbolic transformations (e.g. Wolfram Alpha). The benefits are clear, but it is not always possible to compute an analytical expression.
  • Numerical Differentiation: computes the derivatives using expressions that are suitable to be evaluated numerically, using the finite differences methods to several orders of approximation. A big drawback is that these methods are slow.
  • Automatic Differentiation: it is an approach in which a library adds to the set of functional primitives (a number of functions for which an implementation is available) an implementation of the derivative for each of these functions. Thus, if the library contains the function $sin(x)$, it also implements the derivative of this function, $\frac{d}{dx}sin(x) = cos(x)$. Then, given a composition of functions, the library can compute the derivative with respect a variable by succesive application of the chain rule, a method that is known in deep learning as backpropagation.

Automatic Differentiation

Modern deep learning libraries are capable of performing automatic differentiation, although the underlying implementation model result in very different user experiences. The two main approaches to computing the graph are static and dynamic processing (Lapan 2018):

  • Static graphs: the deep learning framework converts the computational graph into a static representation that cannot be modified. This allows the library developers to do very aggressive optimizations on this static graph ahead of computation time, pruning some areas and transforming others so that the final product is highly optimized and fast. The drawback is that some models can be really hard to implement with this approach. One of the libraries we use in the course, TensorFlow, uses static graphs. Having static graphs is part of the reason why TensorFlow has excellent support for sequence processing, which makes it very popular in NLP.

  • Dynamic graphs: the framework does not create a graph ahead of computation, but records the operations that are performed, which can be quite different for different inputs. When it is time to compute the gradients, it unrolls the graph and perform the computations. A major benefit of this approach is that implementing complex models can be easier in this paradigm (e.g. conditional computation is really easy). This flexibility comes at the expense of the major drawback of this approach: speed. Dynamic graphs cannot leverage the same level of ahead-of-time optimization, which makes them slower. PyTorch uses dynamic graphs as the underlying paradigm for gradient computation.

Simple Graph

Simple graph to compute $y = wx + b$ (from Rao and MacMahan 2019)

PyTorch computes the graph using the Autograd system. Autograd records a graph when performing the forward pass (function application), keeping track of all the tensors defined as inputs. These are the leaves of the graph. The output tensors are the roots of the graph. By navigating this graph from root to leaves, the gradients are automatically computed using the chain rule. In summary,

  • Forward pass (the successive function application) goes from leaves to root. We use the apply method in PyTorch.
  • Once the forward pass is completed, Autograd has recorded the graph and the backward pass (chain rule) can be done. We use the method .backwards() on the root of the graph.


The base implementation for all neural network models in PyTorch is the class Module in the package torch.nn:

In [29]:
import torch.nn as nn

All our models subclass this base nn.Module class, which provides an interface to important methods used for constructing and working with our models, and contain sensible initializations for our models. Modules can contain (and usually do) other modules.

Let's see a simple, custom implementation of a multi-layer feed forward network. In the example below, our simple mathematical model is

$\mathbf{y} = \mathbf{U}(f(\mathbf{W}(\mathbf{x})))$

where $f$ is a non-linear function (a ReLU), is directly translated into a similar expression in PyTorch. To do that, we simply subclass nn.Module, register the two affine transformations and the non-linearity, and implement their composition within the forward method:

In [30]:
class MyCustomModule(nn.Module):
    def __init__(self, n_inputs, n_hidden, n_output_classes):
        super(MyCustomModule, self).__init__()         # call super to initialize the class above in the hierarchy
        self.W = nn.Linear(n_inputs, n_hidden)         # first affine transformation
        self.f = nn.ReLU()                             # non-linearity (here it is also a layer!)
        self.U = nn.Linear(n_hidden, n_output_classes) # final affine transformation
    def forward(self, x):
       y = self.U(self.f(self.W(x)))
       return y

Then, we can use our new module as follows:

In [31]:
# set the network's architectural parameters
n_inputs = 3
n_hidden= 4
n_output_classes = 2

# instantiate the model
model = MyCustomModule(n_inputs, n_hidden, n_output_classes)

# create a simple input tensor 
x = torch.FloatTensor([[0.3, 0.8, -0.4]]) # size is [1,3]: a mini-batch of one example, this example having dimension 3

# compute the model output by **applying** the input to the module
y = model(x)

# inspect the output
tensor([[-0.5364, -0.1848]], grad_fn=<AddmmBackward>)

As we see, the output is a tensor with its gradient function attached -- Autograd tracks it for us.

Tip: modules overrides the __call__() method, where the framework does some work. Thus, instead of calling directly the forward() method, apply the input to the model instead.


A powerful class in the nn package is Sequential, that allows to express the code above more succintly:

In [32]:
class MyCustomModule(nn.Module):
    def __init__(self, n_inputs, n_hidden, n_output_classes):
        super(MyCustomModule, self).__init__() = nn.Sequential(
            nn.Linear(n_inputs, n_hidden),
            nn.Linear(n_hidden, n_output_classes))
    def forward(self, x):
       y =
       return y

As you can imagine, this can be handy when we have a large number of layers for which the actual names are not that meaningful. It also improves readability:

In [33]:
class MyCustomModule(nn.Module):
    def __init__(self, n_inputs, n_hidden, n_output_classes):
        super(MyCustomModule, self).__init__()
        self.p_keep = 0.7 = nn.Sequential(
            nn.Linear(n_inputs, n_hidden),
            nn.Linear(n_hidden, 2*n_hidden),
            nn.Linear(2*n_hidden, n_output_classes),   
            nn.Dropout(1 - self.p_keep),       # dropout argument is probability of dropping
            nn.Softmax(dim=1)                  # applies softmax in the data dimension
    def forward(self, x):
       y =
       return y

Side note: Another important package in torch.nn is Functional, typically imported as F. Functional contains many useful functions from non-linear activations to convolutional, dropout, and even distance functions. Many of these functions have counterpart implementations as layers in the nn package so that they can be easily used in pipelines like the one above implemented using nn.Sequential.

In [34]:
import torch.nn.functional as F

y = F.relu(torch.FloatTensor([[-5, -1, 0, 5]]))

tensor([[0., 0., 0., 5.]])

Criteria and Loss Functions

PyTorch has implementations for the most common criteria in the torch.nn package. You may notice that, as with many of the other functions, there are two implementations of loss functions: the reference functions in torch.nn.functional and practical class in torch.nn, which are the ones we typically use. Probably the two most common ones are (Lapan 2018):

  • nn.MSELoss (mean squared error): squared $L_2$ norm used for regression.
  • nn.CrossEntropyLoss: criterion used for classification as the result of combining nn.LogSoftmax() and nn.NLLLoss() (negative log likelihood), operating on the input scores directly. When possible, we recommend using this class instead of using a softmax layer plus a log conversion and nn.NLLLoss, given that the LossSoftmax implementation guards against common numerical errors, resulting in less instabilities.

Once our model produces a prediction, we pass it to the criteria to obtain a measure of the loss:

In [35]:
y_gold = torch.tensor([1])        # the true label (in this case, 2) from our dataset wrapped 
                                  # as a tensor of minibatch size of 1
criterion = nn.CrossEntropyLoss() # our simple classification criterion for this simple example

y = model(x)                      # forward pass of our model (remember, using apply instead of forward) 

loss = criterion(y, y_gold)       # apply the criterion to get the loss corresponding to the pair (x, y)
                                  # with respect to the real y (y_gold)

print(loss)                       # the loss contains a gradient function that we can use to compute
                                  # the gradient dL/dw (gradient with respect to the parameters for a given fixed input)
tensor(0.5327, grad_fn=<NllLossBackward>)


Once we have computed the loss for a training example or minibatch of examples, we update the parameters of the model guided by the information contained in the gradient. The role of updating the parameters belongs to the optimizer, and PyTorch as a number of implementations available right away -- and if you don't find your preferred optimizer as part of the library, chances are that you will find an existing implementation. Also, coding your own optimizer is indeed quite easy in PyTorch.

Side Note The following is a summary of the most common optimizers. It is intended to serve as a reference (I use this table myself quite a lot). In practice, most people pick an optimizer that has been proven to behave well on a given domain, but optimizers are also a very active area of research on Numerical Analysis, so it is a good idea to pay some attention to this subfield. We recommend to use second-order dynamics with adaptive time step (following Fedkiew 2019):

  • First-order dynamics

    • Search direction only: optim.SGD
    • Adaptive: optim.RMSprop, optim.Adagrad, optim.Adadelta
  • Second-order dynamics

    • Search direction only: Momentum optim.SGD(momentum=0.9), Nesterov, optim.SGD(nesterov=True)
    • Adaptive: optim.Adam, optim.Adamax (Adam with $L_\infty$)

Training a Simple Model

In order to illustrate the different concepts and techniques above, let's put them together in a very simple example: our objective will be to fit a very simple non-linear function, a sine wave:

$y = a \sin(x + \phi)$

where $a, \phi$ are the given amplitude and phase of the sine function. Our objective is to learn to adjust this function using a feed forward network, this is:

$ \hat{y} = f(x)$

such that the error between $y$ and $\hat{y}$ is minimal according to our criterion. A natural criterion is to minimize the squared distance between the actual value of the sine wave and the value predicted by our function approximator, measured using the $L_2$ norm.

Side Note: Although this example is easy, simple variations of this setting can pose a big challenge, and are used currently to illustrate difficult problems in learning, especially in a very active subfield known as meta-learning (e.g. Finn et. al 2017, Schulman and Nichol 2018).

Let's import all the modules that we are going to need:

In [36]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import as data
import numpy as np
import matplotlib.pyplot as plt
import math

Early on the code, we define the device that we want to use:

In [37]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

Let's fix $a=1$, $\phi=1$ and generate traning data in the interval $x \in [0,2\pi)$ using NumPy:

In [38]:
M = 1200

# sample from the x axis M points
x = np.random.rand(M) * 2*math.pi

# add noise
eta = np.random.rand(M) * 0.01

# compute the function
y = np.sin(x) + eta

# plot
<matplotlib.collections.PathCollection at 0x7ffae0379e48>
In [39]:
# use the NumPy-PyTorch bridge
x_train = torch.tensor(x[0:1000]).float().view(-1, 1).to(device)
y_train = torch.tensor(y[0:1000]).float().view(-1, 1).to(device)

x_test = torch.tensor(x[1000:]).float().view(-1, 1).to(device)
y_test = torch.tensor(y[1000:]).float().view(-1, 1).to(device)
In [40]:
class SineDataset(data.Dataset):
    def __init__(self, x, y):
        super(SineDataset, self).__init__()
        assert x.shape[0] == y.shape[0]
        self.x = x
        self.y = y

    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, index):
        return self.x[index], self.y[index]

sine_dataset = SineDataset(x_train, y_train)
sine_dataset_test = SineDataset(x_test, y_test)

sine_loader =, batch_size=32, shuffle=True)
sine_loader_test =, batch_size=32)
In [41]:
class SineModel(nn.Module):
    def __init__(self):
        super(SineModel, self).__init__() = nn.Sequential(
            nn.Linear(1, 5),
            nn.Linear(5, 5),
            nn.Linear(5, 5),
            nn.Linear(5, 1))
    def forward(self, x):
In [42]:
# declare the model
model = SineModel().to(device)

# define the criterion
criterion = nn.MSELoss()

# select the optimizer and pass to it the parameters of the model it will optimize
optimizer = torch.optim.Adam(model.parameters(), lr = 0.01)

epochs = 1000

# training loop
for epoch in range(epochs):
    for i, (x_i, y_i) in enumerate(sine_loader):

        y_hat_i = model(x_i)            # forward pass
        loss = criterion(y_hat_i, y_i)  # compute the loss and perform the backward pass

        optimizer.zero_grad()           # cleans the gradients
        loss.backward()                 # computes the gradients
        optimizer.step()                # update the parameters

    if epoch % 20:
In [43]:
# testing
with torch.no_grad():
    total_loss = 0.
    for k, (x_k, y_k) in enumerate(sine_loader_test):
        y_hat_k = model(x_k)
        loss_test = criterion(y_hat_k, y_k)
        total_loss += float(loss_test)



In [44]:
def enforce_reproducibility(seed=42):
    # Sets seed manually for both CPU and CUDA
    # For atomic operations there is currently 
    # no simple way to enforce determinism, as
    # the order of parallel operations is not known.
    # CUDNN
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # System based


References and Further Reading

Lapan, Maxim (2018) Deep Reinforcement Learning Hands-On. Birmingham: Packt Publishing

Rao, Delip and Brian McMahan (2019) Natural Language Processing with PyTorch. Sebastopol, CA: O'Reilly Media