Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch
Sebastian Raschka 

CPython 3.6.8
IPython 7.2.0

torch 1.0.0

Model Zoo -- Getting Gradients of an Intermediate Variable in PyTorch

This notebook illustrates how we can fetch the intermediate gradients of a function that is composed of multiple inputs and multiple computation steps in PyTorch. Note that gradient is simply a vector listing the derivatives of a function with respect to each argument of the function. So, strictly speaking, we are discussing how to obtain the partial derivatives here.

Assume we have this simple toy graph:

Now, we provide the following values to b, x, and w; the red numbers indicate the intermediate values of the computation and the end result:

Now, the next image shows the partial derivatives of the output node, a, with respect to the input nodes (b, x, and w) as well as all the intermediate partial derivatives:

(The images were taken from my PyData Talk in August 2017, for more information of how to arrive at these derivatives, please see the talk/slides at https://github.com/rasbt/pydata-annarbor2017-dl-tutorial; also, I put up a little calculus and differentiation primer if helpful: https://sebastianraschka.com/pdf/books/dlb/appendix_d_calculus.pdf)

For instance, if we are interested in obtaining the partial derivative of the output a with respect to each of the input and intermediate nodes, we could do the following in TensorFlow, where d_a_b denotes "partial derivative of a with respect to b" and so forth:

In [2]:
import tensorflow as tf

g = tf.Graph()
with g.as_default() as g:
    
    x = tf.placeholder(dtype=tf.float32, shape=None, name='x')
    w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
    b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')
    
    u = x * w
    v = u + b
    a = tf.nn.relu(v)
    
    d_a_x = tf.gradients(a, x)
    d_a_w = tf.gradients(a, w)
    d_a_b = tf.gradients(a, b)
    d_a_u = tf.gradients(a, u)
    d_a_v = tf.gradients(a, v)


with tf.Session(graph=g) as sess:
    sess.run(tf.global_variables_initializer())
    grads = sess.run([d_a_x, d_a_w, d_a_b, d_a_u, d_a_v], feed_dict={'x:0': 3})

print(grads)
[[2.0], [3.0], [1.0], [1.0], [1.0]]

Intermediate Gradients in PyTorch via autograd's grad

In PyTorch, there are multiple ways to compute partial derivatives or gradients. If the goal is to just compute partial derivatives, the most straight-forward way would be using autograd's grad function. By default, the retain_graph parameter of the grad function is set to False, which will free the graph after computing the partial derivative. Thus, if we want to obtain multiple partial derivatives, we need to set retain_graph=True. Note that this is a very inefficient solution though, as multiple passes over the graph are being made where intermediate results are being recalculated:

In [3]:
import torch
import torch.nn.functional as F
from torch.autograd import grad


x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

u = x * w
v = u + b
a = F.relu(v)

d_a_b = grad(a, b, retain_graph=True)
d_a_u = grad(a, u, retain_graph=True)
d_a_v = grad(a, v, retain_graph=True)
d_a_w = grad(a, w, retain_graph=True)
d_a_x = grad(a, x)
    

for name, grad in zip("xwbuv", (d_a_x, d_a_w, d_a_b, d_a_u, d_a_v)):
    print('d_a_%s:' % name, grad)
d_a_x: (tensor([2.]),)
d_a_w: (tensor([3.]),)
d_a_b: (tensor([1.]),)
d_a_u: (tensor([1.]),)
d_a_v: (tensor([1.]),)

As suggested by Adam Paszke, this can be made rewritten in a more efficient manner by passing a tuple to the grad function so that it can reuse intermediate results and only require one pass over the graph:

In [4]:
import torch
import torch.nn.functional as F
from torch.autograd import grad


x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

u = x * w
v = u + b
a = F.relu(v)

partial_derivatives = grad(a, (x, w, b, u, v))

for name, grad in zip("xwbuv", (partial_derivatives)):
    print('d_a_%s:' % name, grad)
d_a_x: tensor([2.])
d_a_w: tensor([3.])
d_a_b: tensor([1.])
d_a_u: tensor([1.])
d_a_v: tensor([1.])

Intermediate Gradients in PyTorch via retain_grad

In PyTorch, we most often use the backward() method on an output variable to compute its partial derivative (or gradient) with respect to its inputs (typically, the weights and bias units of a neural network). By default, PyTorch only stores the gradients of the leaf variables (e.g., the weights and biases) via their grad attribute to save memory. So, if we are interested in the intermediate results in a computational graph, we can use the retain_grad method to store gradients of non-leaf variables as follows:

In [5]:
import torch
import torch.nn.functional as F
from torch.autograd import Variable


x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

u = x * w
v = u + b
a = F.relu(v)

u.retain_grad()
v.retain_grad()

a.backward()

for name, var in zip("xwbuv", (x, w, b, u, v)):
    print('d_a_%s:' % name, var.grad)
d_a_x: tensor([2.])
d_a_w: tensor([3.])
d_a_b: tensor([1.])
d_a_u: tensor([1.])
d_a_v: tensor([1.])

Intermediate Gradients in PyTorch Using Hooks

Finally, and this is a not-recommended workaround, we can use hooks to obtain intermediate gradients. While the two other approaches explained above should be preferred, this approach highlights the use of hooks, which may come in handy in certain situations.

The hook will be called every time a gradient with respect to the variable is computed. (http://pytorch.org/docs/master/autograd.html#torch.autograd.Variable.register_hook)

Based on the suggestion by Adam Paszke (https://discuss.pytorch.org/t/why-cant-i-see-grad-of-an-intermediate-variable/94/7?u=rasbt), we can use these hooks in a combintation with a little helper function, save_grad and a hook closure writing the partial derivatives or gradients to a global variable grads. So, if we invoke the backward method on the output node a, all the intermediate results will be collected in grads, as illustrated below:

In [6]:
import torch
import torch.nn.functional as F


grads = {}
def save_grad(name):
    def hook(grad):
        grads[name] = grad
    return hook


x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

u = x * w
v = u + b

x.register_hook(save_grad('d_a_x'))
w.register_hook(save_grad('d_a_w'))
b.register_hook(save_grad('d_a_b'))
u.register_hook(save_grad('d_a_u'))
v.register_hook(save_grad('d_a_v'))

a = F.relu(v)

a.backward()

grads
Out[6]:
{'d_a_v': tensor([1.]),
 'd_a_b': tensor([1.]),
 'd_a_u': tensor([1.]),
 'd_a_x': tensor([2.]),
 'd_a_w': tensor([3.])}
In [7]:
%watermark -iv
tensorflow  1.12.0
torch       1.0.0