This notebook is A) an attempt to reproduce the plots in Matt Zeiler's AdaDelta tech report and B) investigate AdaDelta and various other first-order optimization methods beyond what Matt did in his report.

I will be using the DeepFried Theano library/toolbox for this.

In [132]:

```
import numpy as np
import matplotlib.pyplot as plt
import sys
import time
from IPython.display import HTML, display
from IPython.html import widgets
```

In [2]:

```
%matplotlib inline
# Font which got unicode math stuff.
import matplotlib as mpl
mpl.rcParams['font.family'] = 'DejaVu Sans'
# Much more readable plots
plt.style.use('ggplot')
```

In [3]:

```
import DeepFried as df
```

You can skip reading these functions. They will be used mostly for prettier presentation of the results and are —I think— self-explanatory boilerplate code.

In [4]:

```
def printnow(where, fmt, *a, **kw):
if where is not None:
where.write(fmt.format(*a, **kw))
where.flush()
```

In [173]:

```
def make_table(data, header, margin, datafmt='{:.2%}', aggfn=None):
if aggfn is None:
aggfn = lambda data: (datafmt + ' ± ' + datafmt).format(np.mean(data), np.std(data))
s = '<table><tr><th>'
for h in header:
s += '<th style="text-align: center">' + h
for m, row in zip(margin, data):
s += '<tr><th>' + m
for d in row:
s += '<td style="text-align: center">' + datafmt.format(d)
if data.ndim > 1 and aggfn is not False:
s += '<th>' + aggfn(row)
s += '</tr>'
if data.ndim > 1 and aggfn is not False:
s += '<tr><th>'
for col in data.T:
s += '<th style="text-align: center">' + aggfn(col)
s += '</table>'
return HTML(s)
```

In [237]:

```
def plot_banded_error(errs, label, ax=None):
ax = ax or plt
m = np.mean(errs, axis=0)*100
s = np.std(errs, axis=0)*100
line, = ax.plot(np.arange(1, len(m)+1), m, label=label)
ax.fill_between(np.arange(1, len(m)+1), m-s, m+s, color=line.get_color(), alpha=0.075)
```

In [238]:

```
def fatlegend(ax, *a, **kw):
leg = ax.legend(*a, **kw)
for l in leg.legendHandles:
l.set_linewidth(l.get_linewidth()*2.0)
return leg
```

In [5]:

```
import pickle
import gzip
(Xtr, ytr), (Xva, yva), (Xte, yte) = pickle.load(gzip.open('mnist.pkl.gz'), encoding='latin1')
ytr, yva, yte = ytr.astype(np.int32), yva.astype(np.int32), yte.astype(np.int32)
Xtr.shape, Xva.shape, Xte.shape
```

Out[5]:

The paper compares experiments to the No More Pesky Learning Rates paper (referenced to as "pesky" from now on), which talks about a 60k training-set, i.e. they merged training and validation. Let's do this too:

In [6]:

```
X, y = np.concatenate((Xtr, Xva)), np.concatenate((ytr, yva))
X.shape, y.shape
```

Out[6]:

The paper's description of the (first) model:

For comparison with Schaul et al.'s method we trained with tanh nonlinearities and 500 hidden units in the first layer followed by 300 hidden units in the second layer, with the final softmax output layer on top. Our method was trained on mini-batches of 100 images per batch for 6 epochs through the training set.

This is a very standard architecture which is easily built using the `Sequence`

container in, technically, a single line of code:

In [46]:

```
tanh_model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.Tanh(init=0.09),
df.layers.FullyConnected(500, 300),
df.layers.Tanh(init=0.09),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
```

Now we need to pass that model to the AdaDelta optimizer along with a cost that should be minimized. The optimizer will "compile" AdaDelta's update-function for this specific model and cost. The paper further details the used training hyperparameters as

Setting the hyperparameters to $\epsilon$ = 1e − 6 and $\rho$ = 0.95 we achieve 2.00% test set error compared to the 2.10% of Schaul et al.

Note that this implementation of the optimizer actually uses the paper's values as default values for these hyperparameters, so we wouldn't even need to explicitly set them here.

In [47]:

```
tanh_model_adadelta = df.optim.StreaMiniAdaDelta(100, tanh_model, df.costs.CategoricalCrossEntropy(), rho=0.95, eps=1e-6)
```

Before starting learning, I need to (re-)initialize both the model and the optimizer. The former will have its weights set to random initial values using the given random seed $13$ according to the initialization parameters of the nonlinearities, while the latter will have its accumulators for momentum-like quantities zeroed. All parameters are initialized to `NaN`

by default, so that there'll be a lot of noise in case one forgets about initialization.

In [52]:

```
tanh_model.reinit(rng=13)
tanh_model_adadelta.reinit()
```

Finally, as stated in the above quote, the model is trained for only six epochs, so we can simply iterate:

In [54]:

```
for epoch in range(6):
printnow(sys.stdout, "Training epoch {} ... \n", epoch+1)
tanh_model_adadelta.fit_epoch(X, y)
```

Now let's look at the error on the test set; for doing so we need a `Predictor`

which compiles the code for doing prediction. We can then use the predictions to compute the error, which should be around 2.00%.

In [55]:

```
tanh_model_pred = df.pred.StreaMiniPredictor(100, tanh_model)
```

In [58]:

```
print("Error: {:.2%}".format(df.postproc.classification_error(tanh_model_pred.pred_epoch(Xte), yte, percent=True)))
```

This looks pretty good.

While the above looks good and seems to reproduce the note's results, I deliberately chose an initialization with a standard deviation of $\sigma = 0.09$ and a random seed of $r = 13$. Let's run a number of experiments with different seeds and with slightly different initializations.

In [192]:

```
def init_seed_experiment(stds, seeds, nepochs=6):
errors = np.empty((len(stds), len(seeds)))
progress = widgets.IntProgress(value=0, min=0, max=len(stds)*len(seeds)-1, description='Training:')
display(progress)
for i, std in enumerate(stds):
model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.Tanh(init=std),
df.layers.FullyConnected(500, 300),
df.layers.Tanh(init=std),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
adadelta = df.optim.StreaMiniAdaDelta(100, model, df.costs.CategoricalCrossEntropy(), rho=0.95, eps=1e-6)
pred = df.pred.StreaMiniPredictor(100, model)
for j, r in enumerate(seeds):
model.reinit(rng=r)
adadelta.reinit()
for epoch in range(nepochs):
adadelta.fit_epoch(X, y)
errors[i,j] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
progress.value += 1
return errors
```

In [193]:

```
stds = (0.05, 0.08, 0.1)
seeds = (7, 13, 42)
errors = init_seed_experiment(stds, seeds)
```

In [179]:

```
make_table(errors, ['r = ' + str(r) for r in seeds], ['σ = ' + str(s) for s in stds])
```

Out[179]:

From this preliminary experiment, it looks like the seed influences the error-rate just as much as the initialization, though that might be due to either too early stopping or too few seeds tried, i.e. luck. Let's first investigate the former by re-running the exact same experiment for a much larger number of epochs: 50. This seems like a reasonable count given the error-curves displayed in Fig.1 of the report.

In [97]:

```
errors50 = init_seed_experiment(stds, seeds, nepochs=50)
```

In [178]:

```
make_table(errors50, ['r = ' + str(r) for r in seeds], ['σ = ' + str(s) for s in stds])
```

Out[178]:

While running for a longer time imporves things a bit, There's still as large as $0.3$ percentage points difference between different seeds, which is three times as much as the $0.1\%$ difference between $2.00\%$ and $2.10\%$ pointed at in the report. Just to confirm that a difference of $0.1\%$ is not really significant for 6 epochs, let's run 6 epochs with much more seeds, so that we get a good estimate of the standard deviation:

In [186]:

```
manyseeds = np.random.randint(2**31, size=20)
errors_manyseeds = init_seed_experiment(stds + ('Xavier',), manyseeds, nepochs=6)
```

In [188]:

```
mean = np.mean(errors_manyseeds, axis=1)
std = np.std(errors_manyseeds, axis=1)
ptp = np.ptp(errors_manyseeds, axis=1)
make_table(np.vstack((mean, std, ptp)).T,
['mean', 'std. dev.', 'max - min'], ['σ = ' + str(s) for s in stds] + ['Xavier'], aggfn=False)
```

Out[188]:

Yep, we can clearly see that $0.1\%$ is well within the variance due to the chosen random seed, thus we shall not make any conclusions based on that figure. Matt doesn't really do so anyways in his report, stating that

While this is nowhere near convergence it gives a sense of how quickly the algorithms can optimize the classification objective.

I simply want to stress that it is important A) to report the used initialization, which is unfortunately left unspecified throughout the report, as well as B) to run multiple experiments with a varying seed in order to determine the significance of a result.

Regarding A), I will simply use `Xavier`

initialization for the remainder of this notebook as it is nowaday's go-to choice for initialization and it performs in the same ballpark as other initializations as seen in the previous experiment.

Regarding B), I will run each experiment five times with a random seed each time and always report mean and standard deviation of all experiments. This way, we can see whether a difference between two experiments is significant or not.

Let's now move on to the meat.

For the main part of the experiments, Rectified Linear Units (ReLUs) are used instead of tanh nonlinearities, everything else stays the same. `Xavier`

initialization is the default initialization used by `df.layers.ReLU`

and we thus don't need to specify anything.

In [199]:

```
relu_model = df.containers.Sequence(
df.layers.FullyConnected(28*28, 500),
df.layers.ReLU(),
df.layers.FullyConnected(500, 300),
df.layers.ReLU(),
df.layers.FullyConnected(300, 10),
df.layers.Softmax()
)
```

Table 2 lists the test error on MNIST after training for just 6 epochs with various values for the parameters $\rho$ and $\epsilon$. The aim of this table is to show that AdaDelta is not very sensitive to its two hyperparameters.

In [200]:

```
def table2_experiment(model, rho, eps, nrounds=5, nepochs=6):
pred = df.pred.StreaMiniPredictor(100, model)
errors = np.empty((len(eps), len(rho), nrounds))
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errors.shape)*nepochs-1, description='Experiments:')
display(progress)
for i, e in enumerate(eps):
for j, r in enumerate(rho):
optim = df.optim.StreaMiniAdaDelta(100, model, df.costs.CategoricalCrossEntropy(), rho=r, eps=e)
for k in range(nrounds):
model.reinit(None) ; optim.reinit()
for epoch in range(nepochs):
optim.fit_epoch(X, y)
progress.value += 1
errors[i,j,k] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
return errors
rhos = (0.9, 0.95, 0.99)
epss = (1e-2, 1e-4, 1e-6, 1e-8)
table2_errors = table2_experiment(relu_model, rhos, epss)
```

In [212]:

```
make_table(np.concatenate((np.mean(table2_errors, axis=2)[:,:,np.newaxis],
np.std(table2_errors, axis=2)[:,:,np.newaxis]), axis=2),
['ρ = ' + str(r) for r in rhos], ['ϵ = {:.0e}'.format(e) for e in epss],
datafmt='{0[0]:.2%} ± {0[1]:.2%}', aggfn=False)
```

Out[212]:

Qualitatively, this table is the same as in the report: the lowest error is reached using $\rho=0.95$ and $\epsilon = 1e^{-6}$, and the errors slightly increase but don't get too large when modifying the hyperparameters. Quantitatively though, the whole table's average error is rougly $0.3\%$ higher than in the AdaDelta report. I'm not sure if this difference stems from the initialization I've used, a bug in my code, or if I'm missing something else. If anybody finds out, please let me know.

Just for reference, here's the same table when taking the lowest error across the five runs:

In [213]:

```
make_table(np.min(table2_errors, axis=2),
['ρ = ' + str(r) for r in rhos], ['ϵ = {:.0e}'.format(e) for e in epss], aggfn=False)
```

Out[213]:

The aim of this table is to show that other first-order optimization methods that are frequently used are *much more* sensitive to their hyperparameters than AdaDelta. I will make this experiment a little more exhaustive by adding all other optimization methods which are implemented in DeepFried, as well as run with multiple values of other methods' additional hyperparameters. This experiment will take quite some time, as it's runing $5 \cdot 11 \cdot 5 = 275$ trainings of six epochs, totalling $1650$ epochs; rejoice of the Jupyter progressbar widget!

In [229]:

```
from collections import OrderedDict
def table1_experiment(model, lrs, optims, nrounds=5, nepochs=6):
pred = df.pred.StreaMiniPredictor(100, model)
errors = np.empty((len(lrs), len(optims), nrounds))
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errors.shape)*nepochs-1, description='Experiments:')
display(progress)
for i, lr in enumerate(lrs):
for j, optim in enumerate(optims):
for k in range(nrounds):
model.reinit(None) ; optim.reinit()
for epoch in range(nepochs):
optim.fit_epoch(X, y, lrate=lr)
progress.value += 1
errors[i,j,k] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
return errors
cost = df.costs.CategoricalCrossEntropy()
lrs = (1e0, 1e-1, 1e-2, 1e-3, 1e-4)
optims = OrderedDict((('SGD', df.optim.StreaMiniSGD(100, relu_model, cost)),
('Momentum .9', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=False)),
('Momentum .95', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=False)),
('Momentum .99', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=False)),
('Nesterov .9', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=True)),
('Nesterov .95', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=True)),
('Nesterov .99', df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=True)),
('AdaGrad', df.optim.StreaMiniAdaGrad(100, relu_model, cost)),
('RMSProp .9', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.9)),
('RMSProp .95', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.95)),
('RMSProp .99', df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.99))))
table1_errors = table1_experiment(relu_model, lrs, optims.values(), nrounds=5, nepochs=6)
```

In [230]:

```
make_table(np.concatenate((np.mean(table1_errors, axis=2)[:,:,np.newaxis],
np.std(table1_errors, axis=2)[:,:,np.newaxis]), axis=2),
optims.keys(), ['λ₀={:.0e}'.format(l) for l in lrs],
datafmt='{0[0]:.2%} ±{0[1]:.2%}', aggfn=False)
```

Out[230]:

This confirms the report's point that other optimization algorithms have a very tight sweet spot for their optimal parameter values, although it's not *that* bad: all typically used momentum "strenghts" give acceptable results for an order of magnitude values of the learning-rate.

Let's pick the best settings for each of the optimizers as found in the previous table and record their test-error throughout the epochs.

In [234]:

```
def record_training(trainer, nrounds=5, nepochs=50, **kw):
pred = df.pred.StreaMiniPredictor(100, trainer.model)
errs = np.empty((nrounds,nepochs))
times = np.empty(nrounds)
progress = widgets.IntProgress(value=0, min=0, max=np.prod(errs.shape)-1, description='Experiments:')
display(progress)
for i in range(nrounds):
trainer.model.reinit(None) ; trainer.reinit()
tstart = time.clock()
for epoch in range(nepochs):
trainer.fit_epoch(X, y, **kw)
errs[i,epoch] = df.postproc.classification_error(pred.pred_epoch(Xte), yte, percent=True)
progress.value += 1
times[i] = time.clock() - tstart
return errs, times
```

In [236]:

```
errs_sgd, t_sgd = record_training(df.optim.StreaMiniSGD(100, relu_model, cost), lrate=1e-1)
errs_mom1, t_mom1 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=False), lrate=1e-1)
errs_mom2, t_mom2 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=False), lrate=1e-2)
errs_mom3, t_mom3 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=False), lrate=1e-3)
errs_nes1, t_nes1 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.9 , nesterov=True), lrate=1e-1)
errs_nes2, t_nes2 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.95, nesterov=True), lrate=1e-2)
errs_nes3, t_nes3 = record_training(df.optim.StreaMiniMomentum(100, relu_model, cost, momentum=0.99, nesterov=True), lrate=1e-3)
errs_adag, t_adag = record_training(df.optim.StreaMiniAdaGrad(100, relu_model, cost), lrate=1e-2)
errs_rms1, t_rms1 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.9), lrate=1e-3)
errs_rms2, t_rms2 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.95), lrate=1e-3)
errs_rms3, t_rms3 = record_training(df.optim.StreaMiniRMSProp(100, relu_model, cost, 0.99), lrate=1e-3)
errs_adad, t_adad = record_training(df.optim.StreaMiniAdaDelta(100, relu_model, cost, 0.99))
```

In [241]:

```
fig, ax = plt.subplots(1,1, figsize=(16,8))
plot_banded_error(errs_sgd, label='SGD', ax=ax)
plot_banded_error(errs_mom1, label='Momentum 0.9', ax=ax)
plot_banded_error(errs_nes1, label='Nesterov 0.9', ax=ax)
plot_banded_error(errs_adag, label='AdaGrad', ax=ax)
plot_banded_error(errs_rms1, label='RMSProp 0.9', ax=ax)
plot_banded_error(errs_adad, label='AdaDelta', ax=ax)
ax.grid(True)
ax.set_ylim(1, 4)
ax.set_xlabel("Number of epochs")
ax.set_ylabel("Errors on testset [%]")
fatlegend(ax)
```

Out[241]:

This plot is very similar to Figure 1 in the report, with most methods plateauing at $1.5\%$ error, momentum being marginally better than AdaDelta, and plain SGD being by far the worst optimizer.

This is the point where I diverge from the report because I can't reproduce the distributed network part and there's some more things I want to investigate.

There's more to an optimizer than the converged error rate: who'd want to wait for ages just to be a few percentage points better? If a method is much faster, that time can be spent looking for better architectures, augmentations, ... In the end, all we care about is wall-clock time $t_w$ to a good result. This wall-clock time can be divided into two parts: the number of epochs $n_e$ to a good result and the time spend per epoch $t_e$, such that $t_w = n_e \cdot t_e + \epsilon$, where $\epsilon$ is more-or-less irrelevant.

From the previous plot (representing Fig.1 from the report), we can see that for this simple network on this simple MNIST dataset, all methods have roughly converged after $n_e = 30$ epochs. I have so far never seen $t_e$ reported in a paper, so let's have a look at that one next.

In [250]:

```
def plot_times(experiments):
N = len(experiments)
W = 0.8
margin = 1-W
plt.bar(np.arange(N), list(map(np.mean, experiments.values())), W, yerr=list(map(np.std, experiments.values())), color=mpl.rcParams['axes.color_cycle'])
plt.xticks(np.arange(N)+W/2, list(experiments.keys()))
plt.xlim(-margin, N-1+W+margin)
plt.ylabel("Time for 50 epochs [s]")
plot_times(OrderedDict((
("SGD", t_sgd),
("Momentum", np.concatenate((t_mom1, t_mom2, t_mom3))),
("Nesterov", np.concatenate((t_nes1, t_nes2, t_nes3))),
("AdaGrad", t_adag),
("RMSProp", np.concatenate((t_rms1, t_rms2, t_rms3))),
("AdaDelta", t_adad)))
)
```

There's two conclusions I see here:

- Obviously, the more complex the optimizer, the slower it gets.
- Plain momentum, if tuned properly, gives the most bang for the buck.

I also ran all of the momentum-based methods (except for AdaDelta) with the three typically used momentum "strenghts:" $0.9$, $0.95$ and $0.99$ ; the following plot compares their training curves:

In [255]:

```
fig, ax = plt.subplots(1,1, figsize=(16,8))
ax.plot(np.mean(errs_mom1, axis=0)*100, label='Momentum 0.9', color='#E24A33', ls='-')
ax.plot(np.mean(errs_mom2, axis=0)*100, label='Momentum 0.95', color='#348ABD', ls='-')
ax.plot(np.mean(errs_mom3, axis=0)*100, label='Momentum 0.99', color='#988ED5', ls='-')
ax.plot(np.mean(errs_nes1, axis=0)*100, label='Nesterov 0.9', color='#E24A33', ls='--')
ax.plot(np.mean(errs_nes2, axis=0)*100, label='Nesterov 0.95', color='#348ABD', ls='--')
ax.plot(np.mean(errs_nes3, axis=0)*100, label='Nesterov 0.99', color='#988ED5', ls='--')
ax.plot(np.mean(errs_rms1, axis=0)*100, label='RMSProp 0.9', color='#E24A33', ls=':')
ax.plot(np.mean(errs_rms2, axis=0)*100, label='RMSProp 0.95', color='#348ABD', ls=':')
ax.plot(np.mean(errs_rms3, axis=0)*100, label='RMSProp 0.99', color='#988ED5', ls=':')
ax.grid(True)
ax.set_ylim(1, 4)
ax.set_xlabel("Number of epochs")
ax.set_ylabel("Errors on testset [%]")
fatlegend(ax)
```

Out[255]:

The conclusions I draw for this plot, again restricted to *this small network on this dataset*, is that `RMSProp`

(the very lightly dotted lines) is pretty consistent across strength of momentum, while classical momentum and Nesterov momentum are consistent with eachother but vary widely across momentum strength.

For this analysis to really be complete, I'd like to add following parts when I get the time to:

- Analysis of the effective learning rates (Section 4.3 of the note). This should be relatively easy, since the accumulated quantities are readily available as
`sh_g2`

and`sh_delta2`

attributes of the optimizer object. - Analysis of robustness on other datasets. For example, I had to dramatically change the value of $\epsilon$ for AdaDelta to work at all on the NSDB Kaggle challenge dataset. This suggests less practical robustness than what the report sells.