In [1]:

```
# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
#sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False
#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
if figsize:
plt.figure(figsize=figsize)
if res_dir:
filename = './res/{}'.format(filename)
plt.imshow(plt.imread(filename))
```

optimization: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$.

expectation is taken across **the data generating distribution** $p_{data}$ rather than just over the finite training set:

\begin{equation} J^*(\theta) = \mathcal{E}_{(x, y) \sim p_{data}} L(f(x; \theta), y) \end{equation}

Rather than optimizing the risk direcly, we optimize the empirical risk, and hope that the risk decreases significantly as well.

Small batches can offer a regularing effect.

gradient can handle smaller batch size like 100, while second-order methods typically require much large batch sizes like 10,000.

minibatches must be selected randomly. For very large datasets, it is usually sufficient to shuffle the order of the dataset once and then store it in shuffled fashion.

On the second pass, the estimator becomes biased because it is formed by re-sampling values used before.

To determin whether ill-conditioning, one can monitor the squared gradient norm $g^T g$ and the $g^T H g$ term. In many cases, the gradient norm does not shrink significantly throughout learning, but the $g^T H g$ term grows by more than order of magnitude.

model identifiability problem: models with latent variables are often not identifiable <= weight space symmetry.

saddle point: local minimum along one cross-section, and local maximum along another another cross-section.

- in higher dimensional spaces, local minima are rare and saddle points are more common.
- difficult for newton's method, while easy for gradient descent.

maxima

- wide, flat regions of constant value

In [2]:

```
show_image("fig8_3.png")
```

can be avoided using the *gradient clipping* heuristic

when graph becomes extremely deep => vanishing and exploding gradient problem

Vanishing gradients make it difficult to know which direction the parameters should move to improve to the cost function, while exploding gradients can make learning unstable.

Many existing research directions are aimed at finding good initial points, rather than developing algorithms that use non-local moves.

In practice, it is necessary to gradually decrease the learning rate over time.

In practice, it is common to decay the learning rate linearly until iteration $\tau$:

\begin{equation} \epsilon_k = (1 - \alpha) \epsilon_0 + \alpha \epsilon_{\tau} \end{equation}

- $\tau$: a few hundred passes through the training set
- $\epsilon_\tau \approx 1\% \, \epsilon_0$
- $\epsilon_0$: monitor the first several iterations and use a learning rate that is higher than the best-performing learning rate at this time, but not so high that it causes severe instability.

To study the convergence rate of an optimization algorithm, it is common to measure the *excess error* $J(\theta) - \min_\theta J(\theta)$.

- SGD is applied to a convex problem: $O(\frac{1}{\sqrt{k}}$
- in the stronly convex case it is $O(\frac{1}{k})$.

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}

The larger $\alpha$ is relative to $\epsilon$, the more previous gradients affect the current direction.

In [3]:

```
show_image("fig8_5.png")
```

the size of each step is $\frac{\epsilon \| g \|}{1 - \alpha} \implies$ it is thus helpful to think of the momentum hyperparameter in terms of $\frac1{1 - \alpha}$.

- Common values of $\alpha$ used in practice include 0.5, 0.9 and 0.99.
- Typically it begins with a small value and is later raised.
- It is less important to adapt $\alpha$ over time than to shrink $\epsilon$ over time.

Nesterov momentum: the gradient is evaluated after the current velocity is applied.

\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta + \color{blue}{\alpha v}), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}

考虑了提前量

Designing improved initialization strategies is a difficult task because neural network optimization is not yet well understood.

A further difficulty is that some initial points may be benefical from the viewpoint of optimization but detrimental from the viewpoint of generalization.

complete certainty: break symmetry between different units

- initialize each unit to compute a different function from all of the other units.
- random initialization of the parameters.
- Typically, set biases for each unit to heuristically chosen constants, and initilize only the weights randomly.

We can think of initializing the parameters $\theta$ to $\theta_0$ as being similar to imposing a Gaussian prior $p(\theta)$ with mean $\theta_0$.

$\implies$ choose $\theta_0$ to be near 0 = more likely that units do not interact with each other than that they do interact.

- normalized initialization: $W_{i, j} \sim U \left ( - \frac{6}{\sqrt{m + n}}, \frac{6}{\sqrt{m + n}} \right )$
- initializing to random orthogonal matrices
- perserve norms
- sparse initialization

A good rule of thumb for choosing the initial scales is to look at the range or standard deviation of activations or gradients on a single minibatch of data.

- Setting the biases to zero is compatible with most weight initialization schemes.
- a few situations where we may set some biases to non-zero values:
- for an output unit, often feneficial to initialize the bias to obtain the right marginal statistics of the output.
- choose the bias to avoid causing too much saturation at initialization.

eg: set the bias of ReLU hidden unit to 0.1 rather than 0 - Sometimes a unit controls whether other units are able to participate in a function => all units have a chance to learn.

eg: to initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs.

the cost is often highly sensitive to some directions in parameter space and insensitive to others. => adapt the learning rates of model parameters.

gradient accumulation

$\text{rate} = \frac1{\sum \text{squared gradients}}$

changing the gradient accumulation into an exponentially weighted moving average.

adaptive moments: combination of RMSProp and momentum

Currently, the most popular optimization algorithms actively in use include

- SGD,
- SGD with momentum,
- RMSProp,
- RMSProp with momentum,
- AdaDelta and Adam.

The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

adaptive reparametrization => training very deep models

bad: variables are dependent.

averaging points.

training sample models on simple tasks => then make the model more complex.

In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.