• 导数
The derivative on each variable tells you the sensitivity of the whole expression on its value.
每个变量上的导数告诉你整个表达式对其值的敏感性。
也就是 变量变动一个单位，函数值变动导数个单位。

the derivatives tell us nothing about the effect of such large changes on the inputs of a function;
They are only informative for tiny, infinitesimally small changes on the inputs, as indicated by the $lim_h→0$ in its definition.

• Unintuitive effects and their consequences
Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input.
Note that in linear classifiers where the weights are dot producted wTxi (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights.
For example, if you multiplied all input data examples xi by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you’d have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases.
In [1]:
import math
import numpy as np

In [5]:
w = [2, -3, -3] # random weights and data
x = [-1, -2]

# forward apss
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0 / (1+math.exp(-dot)) # sigmoid

# backward pass through the neuron
ddot = (1 -f ) * f
dx = [w[0] * ddot, w[1] * ddot] # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w

In [6]:
dx

Out[6]:
[0.3932238664829637, -0.5898357997244456]
In [7]:
dw

Out[7]:
[-0.19661193324148185, -0.3932238664829637, 0.19661193324148185]
In [8]:
x = 3
y = -4

# forward pass
sigy = 1.0 / (1 + math.exp(-y))
num = x + sigy
sigx = 1.0 / (1 + math.exp(-x))
xpy = x + y
xpysqr = xpy**2
den = sigx + xpysqr
invden = 1.0 / den
f = num * invden

In [10]:
# backprop f = num * invden
dnum = invden
dinvden = num
# backprop invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden
# backprop den sigx + xpysqr
dsigx = 1 * dden
dxpysqr = 1 * dden
# backprop xpysqr = xpy ** 2
dxpy = (2 * xpy) * dxpysqr
# backprop xpy = x + y
dx = 1 * dxpy
dy = 1 * dxpy
# backprop sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx
# backprop num = x + sigy
dx += 1 * dnum
dsigy = 1 * dnum
# backprop sigy = 1.0 / (1 + math.exp(y))
dy += ((1 - sigy) * sigy) * dsigy

In [2]:
# forward pass
W = np.random.rand(5, 10)
X = np.random.rand(10, 3)
D = W.dot(X)

# now suppose we had the gradient on D from above in the circuit
dD = np.random.rand(*D.shape)
dW = dD.dot(X.T)
dX = W.T.dot(dD)

In [4]:
D.shape

Out[4]:
(5, 3)
In [5]:
W.shape

Out[5]:
(5, 10)
In [6]:
X.shape

Out[6]:
(10, 3)
In [ ]: