# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
#sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False
#from IPython.display import SVG
def show_image(filename, figsize=None, res_dir=True):
if figsize:
plt.figure(figsize=figsize)
if res_dir:
filename = './res/{}'.format(filename)
plt.imshow(plt.imread(filename))
optimization: finding the parameters $\theta$ of a neural network that significantly reduce a cost function $J(\theta)$.
expectation is taken across the data generating distribution $p_{data}$ rather than just over the finite training set:
\begin{equation} J^*(\theta) = \mathcal{E}_{(x, y) \sim p_{data}} L(f(x; \theta), y) \end{equation}Rather than optimizing the risk direcly, we optimize the empirical risk, and hope that the risk decreases significantly as well.
Small batches can offer a regularing effect.
gradient can handle smaller batch size like 100, while second-order methods typically require much large batch sizes like 10,000.
minibatches must be selected randomly. For very large datasets, it is usually sufficient to shuffle the order of the dataset once and then store it in shuffled fashion.
On the second pass, the estimator becomes biased because it is formed by re-sampling values used before.
To determin whether ill-conditioning, one can monitor the squared gradient norm $g^T g$ and the $g^T H g$ term. In many cases, the gradient norm does not shrink significantly throughout learning, but the $g^T H g$ term grows by more than order of magnitude.
model identifiability problem: models with latent variables are often not identifiable <= weight space symmetry.
saddle point: local minimum along one cross-section, and local maximum along another another cross-section.
maxima
wide, flat regions of constant value
show_image("fig8_3.png")
can be avoided using the gradient clipping heuristic
when graph becomes extremely deep => vanishing and exploding gradient problem
Vanishing gradients make it difficult to know which direction the parameters should move to improve to the cost function, while exploding gradients can make learning unstable.
Many existing research directions are aimed at finding good initial points, rather than developing algorithms that use non-local moves.
In practice, it is necessary to gradually decrease the learning rate over time.
In practice, it is common to decay the learning rate linearly until iteration $\tau$:
\begin{equation} \epsilon_k = (1 - \alpha) \epsilon_0 + \alpha \epsilon_{\tau} \end{equation}To study the convergence rate of an optimization algorithm, it is common to measure the excess error $J(\theta) - \min_\theta J(\theta)$.
The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.
\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}The larger $\alpha$ is relative to $\epsilon$, the more previous gradients affect the current direction.
show_image("fig8_5.png")
the size of each step is $\frac{\epsilon \| g \|}{1 - \alpha} \implies$ it is thus helpful to think of the momentum hyperparameter in terms of $\frac1{1 - \alpha}$.
Nesterov momentum: the gradient is evaluated after the current velocity is applied.
\begin{align} v &\gets \alpha v - \epsilon \Delta_\theta \left ( \frac1{m} \displaystyle \sum^m_{i = 1} L \left ( f(x^{(i)}; \theta + \color{blue}{\alpha v}), y^{(i)} \right ) \right ) \\ \theta &\gets \theta + v \end{align}考虑了提前量
Designing improved initialization strategies is a difficult task because neural network optimization is not yet well understood.
A further difficulty is that some initial points may be benefical from the viewpoint of optimization but detrimental from the viewpoint of generalization.
complete certainty: break symmetry between different units
We can think of initializing the parameters $\theta$ to $\theta_0$ as being similar to imposing a Gaussian prior $p(\theta)$ with mean $\theta_0$.
$\implies$ choose $\theta_0$ to be near 0 = more likely that units do not interact with each other than that they do interact.
A good rule of thumb for choosing the initial scales is to look at the range or standard deviation of activations or gradients on a single minibatch of data.
eg: to initialize a supervised model with the parameters learned by an unsupervised model trained on the same inputs.
the cost is often highly sensitive to some directions in parameter space and insensitive to others. => adapt the learning rates of model parameters.
gradient accumulation
$\text{rate} = \frac1{\sum \text{squared gradients}}$
changing the gradient accumulation into an exponentially weighted moving average.
adaptive moments: combination of RMSProp and momentum
Currently, the most popular optimization algorithms actively in use include
The choice of which algorithm to use, at this point, seems to depend largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).
adaptive reparametrization => training very deep models
bad: variables are dependent.
averaging points.
training sample models on simple tasks => then make the model more complex.
In practice, it is more important to choose a model family that is easy to optimize than to use a powerful optimization algorithm.