Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch
Sebastian Raschka CPython 3.6.8 IPython 7.2.0 torch 1.0.0
This notebook illustrates how we can fetch the intermediate gradients of a function that is composed of multiple inputs and multiple computation steps in PyTorch. Note that gradient is simply a vector listing the derivatives of a function with respect to each argument of the function. So, strictly speaking, we are discussing how to obtain the partial derivatives here.
Assume we have this simple toy graph:
Now, we provide the following values to b, x, and w; the red numbers indicate the intermediate values of the computation and the end result:
Now, the next image shows the partial derivatives of the output node, a, with respect to the input nodes (b, x, and w) as well as all the intermediate partial derivatives:
(The images were taken from my PyData Talk in August 2017, for more information of how to arrive at these derivatives, please see the talk/slides at https://github.com/rasbt/pydata-annarbor2017-dl-tutorial; also, I put up a little calculus and differentiation primer if helpful: https://sebastianraschka.com/pdf/books/dlb/appendix_d_calculus.pdf)
For instance, if we are interested in obtaining the partial derivative of the output a with respect to each of the input and intermediate nodes, we could do the following in TensorFlow, where d_a_b
denotes "partial derivative of a with respect to b" and so forth:
import tensorflow as tf
g = tf.Graph()
with g.as_default() as g:
x = tf.placeholder(dtype=tf.float32, shape=None, name='x')
w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')
u = x * w
v = u + b
a = tf.nn.relu(v)
d_a_x = tf.gradients(a, x)
d_a_w = tf.gradients(a, w)
d_a_b = tf.gradients(a, b)
d_a_u = tf.gradients(a, u)
d_a_v = tf.gradients(a, v)
with tf.Session(graph=g) as sess:
sess.run(tf.global_variables_initializer())
grads = sess.run([d_a_x, d_a_w, d_a_b, d_a_u, d_a_v], feed_dict={'x:0': 3})
print(grads)
[[2.0], [3.0], [1.0], [1.0], [1.0]]
grad
¶In PyTorch, there are multiple ways to compute partial derivatives or gradients. If the goal is to just compute partial derivatives, the most straight-forward way would be using autograd's grad
function. By default, the retain_graph
parameter of the grad
function is set to False
, which will free the graph after computing the partial derivative. Thus, if we want to obtain multiple partial derivatives, we need to set retain_graph=True
. Note that this is a very inefficient solution though, as multiple passes over the graph are being made where intermediate results are being recalculated:
import torch
import torch.nn.functional as F
from torch.autograd import grad
x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)
u = x * w
v = u + b
a = F.relu(v)
d_a_b = grad(a, b, retain_graph=True)
d_a_u = grad(a, u, retain_graph=True)
d_a_v = grad(a, v, retain_graph=True)
d_a_w = grad(a, w, retain_graph=True)
d_a_x = grad(a, x)
for name, grad in zip("xwbuv", (d_a_x, d_a_w, d_a_b, d_a_u, d_a_v)):
print('d_a_%s:' % name, grad)
d_a_x: (tensor([2.]),) d_a_w: (tensor([3.]),) d_a_b: (tensor([1.]),) d_a_u: (tensor([1.]),) d_a_v: (tensor([1.]),)
As suggested by Adam Paszke, this can be made rewritten in a more efficient manner by passing a tuple to the grad
function so that it can reuse intermediate results and only require one pass over the graph:
import torch
import torch.nn.functional as F
from torch.autograd import grad
x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)
u = x * w
v = u + b
a = F.relu(v)
partial_derivatives = grad(a, (x, w, b, u, v))
for name, grad in zip("xwbuv", (partial_derivatives)):
print('d_a_%s:' % name, grad)
d_a_x: tensor([2.]) d_a_w: tensor([3.]) d_a_b: tensor([1.]) d_a_u: tensor([1.]) d_a_v: tensor([1.])
retain_grad
¶In PyTorch, we most often use the backward()
method on an output variable to compute its partial derivative (or gradient) with respect to its inputs (typically, the weights and bias units of a neural network). By default, PyTorch only stores the gradients of the leaf variables (e.g., the weights and biases) via their grad
attribute to save memory. So, if we are interested in the intermediate results in a computational graph, we can use the retain_grad
method to store gradients of non-leaf variables as follows:
import torch
import torch.nn.functional as F
from torch.autograd import Variable
x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)
u = x * w
v = u + b
a = F.relu(v)
u.retain_grad()
v.retain_grad()
a.backward()
for name, var in zip("xwbuv", (x, w, b, u, v)):
print('d_a_%s:' % name, var.grad)
d_a_x: tensor([2.]) d_a_w: tensor([3.]) d_a_b: tensor([1.]) d_a_u: tensor([1.]) d_a_v: tensor([1.])
Finally, and this is a not-recommended workaround, we can use hooks to obtain intermediate gradients. While the two other approaches explained above should be preferred, this approach highlights the use of hooks, which may come in handy in certain situations.
The hook will be called every time a gradient with respect to the variable is computed. (http://pytorch.org/docs/master/autograd.html#torch.autograd.Variable.register_hook)
Based on the suggestion by Adam Paszke (https://discuss.pytorch.org/t/why-cant-i-see-grad-of-an-intermediate-variable/94/7?u=rasbt), we can use these hooks in a combintation with a little helper function, save_grad
and a hook
closure writing the partial derivatives or gradients to a global variable grads
. So, if we invoke the backward
method on the output node a
, all the intermediate results will be collected in grads
, as illustrated below:
import torch
import torch.nn.functional as F
grads = {}
def save_grad(name):
def hook(grad):
grads[name] = grad
return hook
x = torch.tensor([3.], requires_grad=True)
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)
u = x * w
v = u + b
x.register_hook(save_grad('d_a_x'))
w.register_hook(save_grad('d_a_w'))
b.register_hook(save_grad('d_a_b'))
u.register_hook(save_grad('d_a_u'))
v.register_hook(save_grad('d_a_v'))
a = F.relu(v)
a.backward()
grads
{'d_a_v': tensor([1.]), 'd_a_b': tensor([1.]), 'd_a_u': tensor([1.]), 'd_a_x': tensor([2.]), 'd_a_w': tensor([3.])}
%watermark -iv
tensorflow 1.12.0 torch 1.0.0