This notebook is A) an attempt to reproduce the plots in Matt Zeiler's AdaDelta tech report and B) investigate AdaDelta and various other first-order optimization methods beyond what Matt did in his report.

I will be using the DeepFried Theano library/toolbox for this.

In [132]:

```
import numpy as np
import matplotlib.pyplot as plt
import sys
import time
from IPython.display import HTML, display
from IPython.html import widgets
```

In [2]:

```
%matplotlib inline
# Font which got unicode math stuff.
import matplotlib as mpl
mpl.rcParams['font.family'] = 'DejaVu Sans'
# Much more readable plots
plt.style.use('ggplot')
```

In [3]:

```
import DeepFried as df
```

In [4]:

```
def printnow(where, fmt, *a, **kw):
if where is not None:
where.write(fmt.format(*a, **kw))
where.flush()
```

In [173]:

```
def make_table(data, header, margin, datafmt='{:.2%}', aggfn=None):
if aggfn is None:
aggfn = lambda data: (datafmt + ' ± ' + datafmt).format(np.mean(data), np.std(data))
s = '<table><tr><th>'
for h in header:
s += '<th style="text-align: center">' + h
for m, row in zip(margin, data):
s += '<tr><th>' + m
for d in row:
s += '<td style="text-align: center">' + datafmt.format(d)
if data.ndim > 1 and aggfn is not False:
s += '<th>' + aggfn(row)
s += '</tr>'
if data.ndim > 1 and aggfn is not False:
s += '<tr><th>'
for col in data.T:
s += '<th style="text-align: center">' + aggfn(col)
s += '</table>'
return HTML(s)
```

In [237]:

```
def plot_banded_error(errs, label, ax=None):
ax = ax or plt
m = np.mean(errs, axis=0)*100
s = np.std(errs, axis=0)*100
line, = ax.plot(np.arange(1, len(m)+1), m, label=label)
ax.fill_between(np.arange(1, len(m)+1), m-s, m+s, color=line.get_color(), alpha=0.075)
```

In [238]:

```
def fatlegend(ax, *a, **kw):
leg = ax.legend(*a, **kw)
for l in leg.legendHandles:
l.set_linewidth(l.get_linewidth()*2.0)
return leg
```

In [5]:

```
import pickle
import gzip
(Xtr, ytr), (Xva, yva), (Xte, yte) = pickle.load(gzip.open('mnist.pkl.gz'), encoding='latin1')
ytr, yva, yte = ytr.astype(np.int32), yva.astype(np.int32), yte.astype(np.int32)
Xtr.shape, Xva.shape, Xte.shape
```

Out[5]:

In [6]:

```
X, y = np.concatenate((Xtr, Xva)), np.concatenate((ytr, yva))
X.shape, y.shape
```

Out[6]:

The paper's description of the (first) model:

For comparison with Schaul et al.'s method we trained with tanh nonlinearities and 500 hidden units in the first layer followed by 300 hidden units in the second layer, with the final softmax output layer on top. Our method was trained on mini-batches of 100 images per batch for 6 epochs through the training set.

This is a very standard architecture which is easily built using the `Sequence`

container in, technically, a single line of code:

In [46]:

```
tanh_model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.Tanh(init=0.09),
df.layers.FullyConnected(500, 300),
df.layers.Tanh(init=0.09),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
```

Now we need to pass that model to the AdaDelta optimizer along with a cost that should be minimized. The optimizer will "compile" AdaDelta's update-function for this specific model and cost. The paper further details the used training hyperparameters as

Setting the hyperparameters to $\epsilon$ = 1e − 6 and $\rho$ = 0.95 we achieve 2.00% test set error compared to the 2.10% of Schaul et al.

Note that this implementation of the optimizer actually uses the paper's values as default values for these hyperparameters, so we wouldn't even need to explicitly set them here.

In [47]:

```
tanh_model_adadelta = df.optim.StreaMiniAdaDelta(100, tanh_model, df.costs.CategoricalCrossEntropy(), rho=0.95, eps=1e-6)
```

`NaN`

by default, so that there'll be a lot of noise in case one forgets about initialization.

In [52]:

```
tanh_model.reinit(rng=13)
tanh_model_adadelta.reinit()
```

In [54]:

```
for epoch in range(6):
printnow(sys.stdout, "Training epoch {} ... \n", epoch+1)
tanh_model_adadelta.fit_epoch(X, y)
```

`Predictor`

which compiles the code for doing prediction. We can then use the predictions to compute the error, which should be around 2.00%.

In [55]:

```
tanh_model_pred = df.pred.StreaMiniPredictor(100, tanh_model)
```

In [58]:

```
print("Error: {:.2%}".format(df.postproc.classification_error(tanh_model_pred.pred_epoch(Xte), yte, percent=True)))
```

This looks pretty good.

In [192]:

```
def init_seed_experiment(stds, seeds, nepochs=6):
errors = np.empty((len(stds), len(seeds)))
progress = widgets.IntProgress(value=0, min=0, max=len(stds)*len(seeds)-1, description='Training:')
display(progress)
for i, std in enumerate(stds):
model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.Tanh(init=std),
df.layers.FullyConnected(500, 300),
df.layers.Tanh(init=std),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
adadelta = df.optim.StreaMiniAdaDelta(100, model, df.costs.CategoricalCrossEntropy(), rho=0.95, eps=1e-6)
pred = df.pred.StreaMiniPredictor(100, model)
for j, r in enumerate(seeds):
model.reinit(rng=r)
adadelta.reinit()
for epoch in range(nepochs):
adadelta.fit_epoch(X, y)
errors[i,j] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
progress.value += 1
return errors
```

In [193]:

```
stds = (0.05, 0.08, 0.1)
seeds = (7, 13, 42)
errors = init_seed_experiment(stds, seeds)
```

In [179]:

```
make_table(errors, ['r = ' + str(r) for r in seeds], ['σ = ' + str(s) for s in stds])
```

Out[179]:

In [97]:

```
errors50 = init_seed_experiment(stds, seeds, nepochs=50)
```

In [178]:

```
make_table(errors50, ['r = ' + str(r) for r in seeds], ['σ = ' + str(s) for s in stds])
```

Out[178]:

In [186]:

```
manyseeds = np.random.randint(2**31, size=20)
errors_manyseeds = init_seed_experiment(stds + ('Xavier',), manyseeds, nepochs=6)
```

In [188]:

```
mean = np.mean(errors_manyseeds, axis=1)
std = np.std(errors_manyseeds, axis=1)
ptp = np.ptp(errors_manyseeds, axis=1)
make_table(np.vstack((mean, std, ptp)).T,
['mean', 'std. dev.', 'max - min'], ['σ = ' + str(s) for s in stds] + ['Xavier'], aggfn=False)
```

Out[188]:

Yep, we can clearly see that $0.1\%$ is well within the variance due to the chosen random seed, thus we shall not make any conclusions based on that figure. Matt doesn't really do so anyways in his report, stating that

While this is nowhere near convergence it gives a sense of how quickly the algorithms can optimize the classification objective.

I simply want to stress that it is important A) to report the used initialization, which is unfortunately left unspecified throughout the report, as well as B) to run multiple experiments with a varying seed in order to determine the significance of a result.

Regarding A), I will simply use `Xavier`

initialization for the remainder of this notebook as it is nowaday's go-to choice for initialization and it performs in the same ballpark as other initializations as seen in the previous experiment.

Regarding B), I will run each experiment five times with a random seed each time and always report mean and standard deviation of all experiments. This way, we can see whether a difference between two experiments is significant or not.

Let's now move on to the meat.

`Xavier`

initialization is the default initialization used by `df.layers.ReLU`

and we thus don't need to specify anything.

In [199]:

```
relu_model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.ReLU(),
df.layers.FullyConnected(500, 300),
df.layers.ReLU(),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
```

In [200]:

```
def table2_experiment(model, rho, eps, nrounds=5, nepochs=6):
pred = df.pred.StreaMiniPredictor(100, model)
errors = np.empty((len(eps), len(rho), nrounds))
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errors.shape)*nepochs-1, description='Experiments:')
display(progress)
for i, e in enumerate(eps):
for j, r in enumerate(rho):
optim = df.optim.StreaMiniAdaDelta(100, model, df.costs.CategoricalCrossEntropy(), rho=r, eps=e)
for k in range(nrounds):
model.reinit(None) ; optim.reinit()
for epoch in range(nepochs):
optim.fit_epoch(X, y)
progress.value += 1
errors[i,j,k] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
return errors
rhos = (0.9, 0.95, 0.99)
epss = (1e-2, 1e-4, 1e-6, 1e-8)
table2_errors = table2_experiment(relu_model, rhos, epss)
```

In [212]:

```
make_table(np.concatenate((np.mean(table2_errors, axis=2)[:,:,np.newaxis],
np.std(table2_errors, axis=2)[:,:,np.newaxis]), axis=2),
['ρ = ' + str(r) for r in rhos], ['ϵ = {:.0e}'.format(e) for e in epss],
datafmt='{0[0]:.2%} ± {0[1]:.2%}', aggfn=False)
```

Out[212]:

Qualitatively, this table is the same as in the report: the lowest error is reached using $\rho=0.95$ and $\epsilon = 1e^{-6}$, and the errors slightly increase but don't get too large when modifying the hyperparameters. Quantitatively though, the whole table's average error is rougly $0.3\%$ higher than in the AdaDelta report. I'm not sure if this difference stems from the initialization I've used, a bug in my code, or if I'm missing something else. If anybody finds out, please let me know.

Just for reference, here's the same table when taking the lowest error across the five runs:

In [213]:

```
make_table(np.min(table2_errors, axis=2),
['ρ = ' + str(r) for r in rhos], ['ϵ = {:.0e}'.format(e) for e in epss], aggfn=False)
```

Out[213]:

*much more* sensitive to their hyperparameters than AdaDelta. I will make this experiment a little more exhaustive by adding all other optimization methods which are implemented in DeepFried, as well as run with multiple values of other methods' additional hyperparameters. This experiment will take quite some time, as it's runing $5 \cdot 11 \cdot 5 = 275$ trainings of six epochs, totalling $1650$ epochs; rejoice of the Jupyter progressbar widget!

In [229]:

```
from collections import OrderedDict
def table1_experiment(model, lrs, optims, nrounds=5, nepochs=6):
pred = df.pred.StreaMiniPredictor(100, model)
errors = np.empty((len(lrs), len(optims), nrounds))
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errors.shape)*nepochs-1, description='Experiments:')
display(progress)
for i, lr in enumerate(lrs):
for j, optim in enumerate(optims):
for k in range(nrounds):
model.reinit(None) ; optim.reinit()
for epoch in range(nepochs):
optim.fit_epoch(X, y, lrate=lr)
progress.value += 1
errors[i,j,k] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
return errors
cost = df.costs.CategoricalCrossEntropy()
lrs = (1e0, 1e-1, 1e-2, 1e-3, 1e-4)
optims = OrderedDict((('SGD', df.optim.StreaMiniSGD(100, relu_model, cost)),
('Momentum .9', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=False)),
('Momentum .95', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=False)),
('Momentum .99', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=False)),
('Nesterov .9', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=True)),
('Nesterov .95', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=True)),
('Nesterov .99', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=True)),
('AdaGrad', df.optim.StreaMiniAdaGrad(100, relu_model, cost)),
('RMSProp .9', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.9)),
('RMSProp .95', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.95)),
('RMSProp .99', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.99))))
table1_errors = table1_experiment(relu_model, lrs, optims.values(), nrounds=5, nepochs=6)
```

In [230]:

```
make_table(np.concatenate((np.mean(table1_errors, axis=2)[:,:,np.newaxis],
np.std(table1_errors, axis=2)[:,:,np.newaxis]), axis=2),
optims.keys(), ['λ₀={:.0e}'.format(l) for l in lrs],
datafmt='{0[0]:.2%} ±{0[1]:.2%}', aggfn=False)
```

Out[230]:

*that* bad: all typically used momentum "strenghts" give acceptable results for an order of magnitude values of the learning-rate.

In [234]:

```
def record_training(trainer, nrounds=5, nepochs=50, **kw):
pred = df.pred.StreaMiniPredictor(100, trainer.model)
errs = np.empty((nrounds,nepochs))
times = np.empty(nrounds)
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errs.shape)-1, description='Experiments:')
display(progress)
for i in range(nrounds):
trainer.model.reinit(None) ; trainer.reinit()
tstart = time.clock()
for epoch in range(nepochs):
trainer.fit_epoch(X, y, **kw)
errs[i,epoch] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
progress.value += 1
times[i] = time.clock() - tstart
return errs, times
```

In [236]:

```
errs_sgd, t_sgd = record_training(df.optim.StreaMiniSGD(100, relu_model, cost), lrate=1e-1)
errs_mom1, t_mom1 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=False), lrate=1e-1)
errs_mom2, t_mom2 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=False), lrate=1e-2)
errs_mom3, t_mom3 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=False), lrate=1e-3)
errs_nes1, t_nes1 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=True), lrate=1e-1)
errs_nes2, t_nes2 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=True), lrate=1e-2)
errs_nes3, t_nes3 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=True), lrate=1e-3)
errs_adag, t_adag = record_training(df.optim.StreaMiniAdaGrad(100, relu_model, cost), lrate=1e-2)
errs_rms1, t_rms1 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.9), lrate=1e-3)
errs_rms2, t_rms2 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.95), lrate=1e-3)
errs_rms3, t_rms3 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.99), lrate=1e-3)
errs_adad, t_adad = record_training(df.optim.StreaMiniAdaDelta(100, relu_model, cost, 0.99))
```

In [241]:

```
fig, ax = plt.subplots(1,1, figsize=(16,8))
plot_banded_error(errs_sgd, label='SGD', ax=ax)
plot_banded_error(errs_mom1, label='Momentum 0.9', ax=ax)
plot_banded_error(errs_nes1, label='Nesterov 0.9', ax=ax)
plot_banded_error(errs_adag, label='AdaGrad', ax=ax)
plot_banded_error(errs_rms1, label='RMSProp 0.9', ax=ax)
plot_banded_error(errs_adad, label='AdaDelta', ax=ax)
ax.grid(True)
ax.set_ylim(1, 4)
ax.set_xlabel("Number of epochs")
ax.set_ylabel("Errors on testset [%]")
fatlegend(ax)
```

Out[241]:

There's more to an optimizer than the converged error rate: who'd want to wait for ages just to be a few percentage points better? If a method is much faster, that time can be spent looking for better architectures, augmentations, ... In the end, all we care about is wall-clock time $t_w$ to a good result. This wall-clock time can be divided into two parts: the number of epochs $n_e$ to a good result and the time spend per epoch $t_e$, such that $t_w = n_e \cdot t_e + \epsilon$, where $\epsilon$ is more-or-less irrelevant.

From the previous plot (representing Fig.1 from the report), we can see that for this simple network on this simple MNIST dataset, all methods have roughly converged after $n_e = 30$ epochs. I have so far never seen $t_e$ reported in a paper, so let's have a look at that one next.

In [250]:

```
def plot_times(experiments):
N = len(experiments)
W = 0.8
margin = 1-W
plt.bar(np.arange(N), list(map(np.mean, experiments.values())), W, yerr=list(map(np.std, experiments.values())), color=mpl.rcParams['axes.color_cycle'])
plt.xticks(np.arange(N)+W/2, list(experiments.keys()))
plt.xlim(-margin, N-1+W+margin)
plt.ylabel("Time for 50 epochs [s]")
plot_times(OrderedDict((
("SGD", t_sgd),
("Momentum", np.concatenate((t_mom1, t_mom2, t_mom3))),
("Nesterov", np.concatenate((t_nes1, t_nes2, t_nes3))),
("AdaGrad", t_adag),
("RMSProp", np.concatenate((t_rms1, t_rms2, t_rms3))),
("AdaDelta", t_adad)))
)
```

There's two conclusions I see here:

- Obviously, the more complex the optimizer, the slower it gets.
- Plain momentum, if tuned properly, gives the most bang for the buck.

In [255]:

```
fig, ax = plt.subplots(1,1, figsize=(16,8))
ax.plot(np.mean(errs_mom1, axis=0)*100, label='Momentum 0.9', color='#E24A33', ls='-')
ax.plot(np.mean(errs_mom2, axis=0)*100, label='Momentum 0.95', color='#348ABD', ls='-')
ax.plot(np.mean(errs_mom3, axis=0)*100, label='Momentum 0.99', color='#988ED5', ls='-')
ax.plot(np.mean(errs_nes1, axis=0)*100, label='Nesterov 0.9', color='#E24A33', ls='--')
ax.plot(np.mean(errs_nes2, axis=0)*100, label='Nesterov 0.95', color='#348ABD', ls='--')
ax.plot(np.mean(errs_nes3, axis=0)*100, label='Nesterov 0.99', color='#988ED5', ls='--')
ax.plot(np.mean(errs_rms1, axis=0)*100, label='RMSProp 0.9', color='#E24A33', ls=':')
ax.plot(np.mean(errs_rms2, axis=0)*100, label='RMSProp 0.95', color='#348ABD', ls=':')
ax.plot(np.mean(errs_rms3, axis=0)*100, label='RMSProp 0.99', color='#988ED5', ls=':')
ax.grid(True)
ax.set_ylim(1, 4)
ax.set_xlabel("Number of epochs")
ax.set_ylabel("Errors on testset [%]")
fatlegend(ax)
```

Out[255]:

*this small network on this dataset*, is that `RMSProp`

(the very lightly dotted lines) is pretty consistent across strength of momentum, while classical momentum and Nesterov momentum are consistent with eachother but vary widely across momentum strength.

For this analysis to really be complete, I'd like to add following parts when I get the time to:

- Analysis of the effective learning rates (Section 4.3 of the note). This should be relatively easy, since the accumulated quantities are readily available as
`sh_g2`

and`sh_delta2`

attributes of the optimizer object. - Analysis of robustness on other datasets. For example, I had to dramatically change the value of $\epsilon$ for AdaDelta to work at all on the NSDB Kaggle challenge dataset. This suggests less practical robustness than what the report sells.