#!/usr/bin/env python # coding: utf-8 # # 3.14 Forward Propagation, Back Propagation and Computational Graphs # - In the previous secitons... # - we calculated the forward propagation of the model # - we calculated the model output for the input # - Then, execute the auto-generated `backward` function to calculate the gradient # - based on back-propagation # - In this section... # - we will use both mathematical and computational graphs to describe ***forward and back propagation*** # ## 3.14.1 Forward Propagation # - Forward Propagation # - It refers to ***the calculation and storage of intermediate variables*** (including outputs) for the neural network in the order from input layer to output layer. # - Let’s assume that the input example is $\mathbf{x}\in \mathbb{R}^d$ and there is no bias term. # - Here the intermediate variable is # # $$\mathbf{z}= \mathbf{W}^{(1)} \mathbf{x}$$ # where $\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}$ is the weight parameter of the hidden layer. # - After entering the intermediate variable $\mathbf{z}\in \mathbb{R}^h$ into the activation function $\phi$ operated by the basic elements # # $$\mathbf{h}= \phi (\mathbf{z}).$$ # - The hidden variable $\mathbf{h}$ is also an intermediate variable. # - Assuming the parameters of the output layer only possess a weight of $\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$, we can obtain an output layer variable with a vector length of $q$, # # $$\mathbf{o}= \mathbf{W}^{(2)} \mathbf{h}.$$
# # - Assuming the loss function is $l$ and the example label is $y$, we can then calculate the loss term for a single data example, # # $$L = l(\mathbf{o}, y).$$
# - According to the definition of $\ell_2$ norm regularization, given the hyper-parameter $\lambda$, the regularization term is # # $$s = \frac{\lambda}{2} \left(|\mathbf{W}^{(1)}|_F^2 + |\mathbf{W}^{(2)}|_F^2\right),$$
# where the Frobenius norm of the matrix is equivalent to the calculation of the $L_2$ norm after flattening the matrix to a vector. # - https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm # - The model's regularized loss on a given data example # # $$J = L + s.$$ # - We refer to $J$ as the ***objective function*** of a given data exampl # ## 3.14.2 Computational Graph of Forward Propagation # - Plotting computational graphs helps us visualize the dependencies of operators and variables within the calculation. # ![](https://github.com/d2l-ai/d2l-en/raw/master/img/forward.svg?sanitize=true![image.png](attachment:image.png) # ## 3.14.3 Back Propagation # - Back propagation refers to the method of ***calculating the gradient of neural network parameters*** in the order of the output layer to the input layer according to the ‘chain rule ’in calculus. # - Chain rule # - Assume that we have functions $\mathsf{Y}=f(\mathsf{X})$ and $\mathsf{Z}=g(\mathsf{Y}) = g \circ f(\mathsf{X})$, in which the input and the output $\mathsf{X}, \mathsf{Y}, \mathsf{Z}$ are tensors of arbitrary shapes. # - By using the chain rule we can compute the derivative of $\mathsf{Z}$ wrt. $\mathsf{X}$ via # # $$\frac{\partial \mathsf{Z}}{\partial \mathsf{X}} = \text{prod}\left(\frac{\partial \mathsf{Z}}{\partial \mathsf{Y}}, \frac{\partial \mathsf{Y}}{\partial \mathsf{X}}\right).$$ # # - The parameters of the simple network with one hidden layer are $\mathbf{W}^{(1)}$ and $\mathbf{W}^{(2)}$ ==> The objective of back propagation is to calculate the gradients $\partial J/\partial \mathbf{W}^{(1)}$ and $\partial J/\partial \mathbf{W}^{(2)}$. # - The first step is to calculate the gradients of the objective function $J=L+s$ with respect to the loss term $L$ and the regularization term $s$. # # $$\frac{\partial J}{\partial L} = 1 \text{ and } \frac{\partial J}{\partial s} = 1$$ # - Next we compute the gradient of the objective function with respect to variable of the output layer $\mathbf{o}$ according to the chain rule. # # $$ \frac{\partial J}{\partial \mathbf{o}} = \text{prod}\left(\frac{\partial J}{\partial L}, \frac{\partial L}{\partial \mathbf{o}}\right) = \frac{\partial L}{\partial \mathbf{o}} \in \mathbb{R}^q $$
# # - Next we calculate the gradients of the regularization term with respect to both parameters. # # $$\frac{\partial s}{\partial \mathbf{W}^{(1)}} = \lambda \mathbf{W}^{(1)}, \frac{\partial s}{\partial \mathbf{W}^{(2)}} = \lambda \mathbf{W}^{(2)} \text{ and } \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}} = \mathbf{h}^\top $$
# # - Now we are able calculate the gradient $\partial J/\partial \mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}$ of the model parameters closest to the output layer. Using the chain rule yields: # # $$ \frac{\partial J}{\partial \mathbf{W}^{(2)}} = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{W}^{(2)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial \mathbf{W}^{(2)}}\right) = \frac{\partial L}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)} $$
# # - and so forth. # ## 3.14.4 Training a Model # - When training networks, forward and backward propagation depend on each other. # - For forward propagation we traverse the compute graph in the direction of dependencies and compute all the variables on its path. # - For backpropagation, the variables are used to compute gradients on the graph in reverse order. # - Why backpropagation requires significantly more memory? # - 1) We need to retain the intermediate values until backpropagation is complete. # - 2) we typically train with minibatches containing more than one variable # ## 3.15 Numerical Stability and Initialization # ## 3.15.1 Vanishing and Exploding Gradients # - Consider a deep network with $d$ layers, input $\mathbf{x}$ and output $\mathbf{o}$. Each layer satisfies: # # $$\mathbf{h}^{t+1} = f_t (\mathbf{h}^t) \text{ and thus } \mathbf{o} = f_d \circ \ldots \circ f_1(\mathbf{x})$$ # - The gradient of $\mathbf{o}$ with respect to any set of parameters $\mathbf{W}_t$ associated with the function $f_t$ at layer $t$ simply as # # $$\partial_{\mathbf{W}t} \mathbf{o} = \underbrace{\partial_{\mathbf{h}^{d-1}} \mathbf{h}^d}_{:= \mathbf{M}_d} \cdot \ldots \cdot \underbrace{\partial_{\mathbf{h}^{t+1}} \mathbf{h}^{t+2}}_{:= \mathbf{M}_{t+1}} \cdot \underbrace{\partial_{\mathbf{h}^{t}} \mathbf{h}^{t+1}}_{:= \mathbf{M}_t} \cdot \underbrace{\partial_{\mathbf{W}t} \mathbf{h}^t}_{:= \mathbf{v}_t}.$$ # # - The matrices $M_t$ may well have a wide variety of eigenvalues. # - They might be small, they might be large, and in particular, their product might well be very large or very small. # - http://setosa.io/ev/eigenvectors-and-eigenvalues/ # - This means that the optimization algorithm is bound to fail. # - It either receives gradients with excessively large or excessively small steps. # - In the former case, the parameters explode # - In the latter case, we end up with vanishing gradients and no meanigful progress. # - Exploding Gradients # - we draw 100 Gaussain random matrices and multiply them with some initial matrix. # In[15]: get_ipython().run_line_magic('matplotlib', 'inline') import mxnet as mx from mxnet import nd, autograd from matplotlib import pyplot as plt from numpy import linalg as LA M = nd.random.normal(shape=(4,4)) print('A single matrix', M) print() w, v = LA.eig(M.asnumpy()) print(w) for i in range(100): M = nd.dot(M, nd.random.normal(shape=(4,4))) print('After multiplying 100 matrices', M) # - Vanishing Gradients # - Picking 'sigmoid' as nonlinear activation function might be problematic. # - If the activations are not in the range of $[-4, 4]$, the gradients of the overall product may vanish. # - Before ReLu $\max(0,x)$ was proposed, this problem used to be the bane of deep network training. # - As a consequence ReLu has become the default choice when designing activation functions in deep networks. # In[13]: x = nd.arange(-8.0, 8.0, 0.1) x.attach_grad() with autograd.record(): y = x.sigmoid() y.backward() plt.figure(figsize=(8, 4)) plt.plot(x.asnumpy(), y.asnumpy()) plt.plot(x.asnumpy(), x.grad.asnumpy()) plt.legend(['sigmoid', 'gradient']) plt.show() # - Symmetry # - Assume that we initialize the parameters of a layer as $\mathbf{W}_l = 0$ or all entries of $\mathbf{W}_l$ are identical. # - In this case the gradients for all units in the layer are identical. # - We will never be able to use the expressive power inherent in the layer. # - In fact, the hidden layer behaves as if it had only a single unit. # ## 3.15.2 Parameter Initialization # - We can ensure that at least initially the gradients do not vanish and that they are within a reasonable scale where the network weights do not diverge. # - Basic initialization # - `net.initialize()` # - It use the default random initialization method # - Each element of the weight parameter is randomly sampled with an uniform distribution $U[-0.07, 0.07]$ # - Bias parameters are all set to $0$. # # - `net.initialize(init.Normal(sigma=0.01))` # - Pick normally distributed random numbers as initial values for the weights and bias. # - Both choices tend to work quite well in practice for moderate problem sizes. # # - Xavier Initialization # - http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf # - Assume that hidden units of a layer $$h_{i} = \sum_{j=1}^{n_\mathrm{in}} W_{ij} x_j$$ # - The weights $W_{ij}$ are all drawn independently from the same distribution. # - Let's assume that this distribution has zero mean and variance $\sigma^2$ # - Let's assume that the inputs also have zero mean and variance $\gamma^2$ and that they're independent of $\mathbf{W}$. # - In this case we can compute mean and variance of $h_i$ as follows: $$ \begin{aligned} \mathbf{E}[h_i] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W_{ij} x_j] = 0 \\ \mathbf{E}[h_i^2] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij} x^2_j] \\ & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij}] \mathbf{E}[x^2_j] \\ & = n_\mathrm{in} \sigma^2 \gamma^2 \end{aligned} $$ # # - One way to keep the variance fixed is to set $n_\mathrm{in} \sigma^2 = 1$. # - When backpropagation is executed, instead of $\mathbf{W} \mathbf{w}$, we need to deal with $\mathbf{W}^\top \mathbf{g}$, where $\mathbf{g}$ is the incoming gradient from the layer above. # - Using the same reasoning, one way to keep the gradients' variance fixed is to set $n_\mathrm{out} \sigma^2 = 1$. # - So, we simply try to satisfy $$ \begin{aligned} \frac{1}{2} (n_\mathrm{in} + n_\mathrm{out}) \sigma^2 = 1 \text{ or equivalently } \sigma = \sqrt{\frac{2}{n_\mathrm{in} + n_\mathrm{out}}} \end{aligned} $$ # - For Gaussian random variables... # - the Xavier initialization picks a normal distribution with zero mean and variance $\sigma^2 = 2/(n_\mathrm{in} + n_\mathrm{out})$. # - For uniformly distributed random variables $U[-a, a]$... # - What do we choose $a$ value? # - Variance $\sigma^2$ is given by $a^2/3$. # - Plugging $a^2/3$ into $\sigma^2$ yields that $a=\sqrt{6/(n_\mathrm{in} + n_\mathrm{out})}$ # # - Beyond # - The `mxnet.initializer` package # - https://mxnet.incubator.apache.org/api/python/optimization/optimization.html#the-mxnet-initializer-package # # 3.16 Environment # - Two important things # - 1) where the data came from # - 2) how the built models get deployed # ## 3.16.1 Covariate Shift # - ***Covariate Shift*** # - the situation where the distribution over the covariates (aka training data) is shifted on test data relative to the training case # - Mathematically speaking, we are referring the case where $p(x)$ changes but $p(y|x)$ remains unchanged # In[ ]: