Notebook

Monitoring optimisation¶

In this notebook we'll demo how to use gpflow.training.monitor for logging the optimisation of a GPflow model.

In [1]:

import itertools
import os
import numpy as np
import gpflow
import gpflow.training.monitor as mon
import numbers
import matplotlib.pyplot as plt
import tensorflow as tf

%matplotlib inline

Creating the GPflow model¶

We first generate some random data and create a GPflow model.

Under the hood, GPflow gives a unique name to each model which is used to name the Variables it creates in the TensorFlow graph containing a random identifier. This is useful in interactive sessions, where people may create a few models, to prevent variables with the same name conflicting. However, when loading the model, we need to make sure that the names of all the variables are exactly the same as in the checkpoint. This is why we pass name="SVGP" to the model constructor, and why we use gpflow.defer_build().

In [2]:

np.random.seed(0)
X = np.random.rand(10000, 1) * 10
Y = np.sin(X) + np.random.randn(*X.shape)
Xt = np.random.rand(10000, 1) * 10
Yt = np.sin(Xt) + np.random.randn(*Xt.shape)

with gpflow.defer_build():
    m = gpflow.models.SVGP(X, Y, gpflow.kernels.RBF(1), gpflow.likelihoods.Gaussian(),
                           Z=np.linspace(0, 10, 5)[:, None],
                           minibatch_size=100, name="SVGP")
    m.likelihood.variance = 0.01
m.compile()

Let's compute log likelihood before the optimisation.

In [3]:

print('LML before the optimisation: %f' % m.compute_log_likelihood())

LML before the optimisation: -1271605.621944

We will be using a TensorFlow optimiser. All TensorFlow optimisers have a support for global_step variable. Its purpose is to track how many optimisation steps have occurred. It is useful to keep this in a TensorFlow variable as this allows it to be restored together with all the parameters of the model.

The code below creates this variable using a monitor's helper function. It is important to create it before building the monitor in case the monitor includes a checkpoint task. This is because the checkpoint internally uses the TensorFlow Saver which creates a list of variables to save. Therefore all variables expected to be saved by the checkpoint task should exist by the time the task is created.

In [4]:

session = m.enquire_session()
global_step = mon.create_global_step(session)

We also need to create our optimiser before building the monitor to make sure that it is restored correctly. Adam, for example, keeps track of certain gradient moments that have been accumulated over time. Momentum is another example of a state in the optimiser that may need to be restored. We also need to call minimize to initialise all the variables in the optimiser. We run for zero iterations so no actual optimisation is done. This is a slight hack.

In [5]:

optimiser = gpflow.train.AdamOptimizer(0.01)
optimiser.minimize(m, maxiter=0)

Construct the monitor¶

Next we need to construct the monitor. gpflow.training.monitor provides classes that are building blocks for the monitor. Essengially, a monitor is a function that is provided as a callback to an optimiser. It consists of a number of tasks that may be executed at each step, subject to their running condition.

In this example, we want to:

log certain scalar parameters in TensorBoard,
log the full optimisation objective (log marginal likelihood bound) periodically, even though we optimise with minibatches,
store a backup of the optimisation process periodically,
log performance for a test set periodically.

We will define these tasks as follows:

In [6]:

print_task = mon.PrintTimingsTask().with_name('print')\
    .with_condition(mon.PeriodicIterationCondition(10))\
    .with_exit_condition(True)

sleep_task = mon.SleepTask(0.01).with_name('sleep').with_name('sleep')

saver_task = mon.CheckpointTask('./monitor-saves').with_name('saver')\
    .with_condition(mon.PeriodicIterationCondition(10))\
    .with_exit_condition(True)

file_writer = mon.LogdirWriter('./model-tensorboard')

model_tboard_task = mon.ModelToTensorBoardTask(file_writer, m).with_name('model_tboard')\
    .with_condition(mon.PeriodicIterationCondition(10))\
    .with_exit_condition(True)

lml_tboard_task = mon.LmlToTensorBoardTask(file_writer, m).with_name('lml_tboard')\
    .with_condition(mon.PeriodicIterationCondition(100))\
    .with_exit_condition(True)

As the above code shows, each task can be assigned a name and running conditions. The name will be shown in the task timing summary.

There are two different types of running conditions: with_condition controls execution of the task at each iteration in the optimisation loop. with_exit_condition is a simple boolean flag indicating that the task should also run at the end of optimisation. In this example we want to run our tasks periodically, at every iteration or every 10th or 100th iteration.

Notice that the two TensorBoard tasks will write events into the same file. It is possible to share a file writer between multiple tasks. However it is not possible to share the same event location between multiple file writers. An attempt to open two writers with the same location will result in error.

Custom tasks¶

We may also want to perfom certain tasks that do not have pre-defined Task classes. For example, we may want to compute the performance on a test set. Here we create such a class by extending BaseTensorBoardTask to log the testing benchmarks in addition to all the scalar parameters.

In [7]:

class CustomTensorBoardTask(mon.BaseTensorBoardTask):
    def __init__(self, file_writer, model, Xt, Yt):
        super().__init__(file_writer, model)
        self.Xt = Xt
        self.Yt = Yt
        self._full_test_err = tf.placeholder(gpflow.settings.float_type, shape=())
        self._full_test_nlpp = tf.placeholder(gpflow.settings.float_type, shape=())
        self._summary = tf.summary.merge([tf.summary.scalar("test_rmse", self._full_test_err),
                                         tf.summary.scalar("test_nlpp", self._full_test_nlpp)])
    
    def run(self, context: mon.MonitorContext, *args, **kwargs) -> None:
        minibatch_size = 100
        preds = np.vstack([self.model.predict_y(Xt[mb * minibatch_size:(mb + 1) * minibatch_size, :])[0]
                            for mb in range(-(-len(Xt) // minibatch_size))])
        test_err = np.mean((Yt - preds) ** 2.0)**0.5
        self._eval_summary(context, {self._full_test_err: test_err, self._full_test_nlpp: 0.0})

        
custom_tboard_task = CustomTensorBoardTask(file_writer, m, Xt, Yt).with_name('custom_tboard')\
    .with_condition(mon.PeriodicIterationCondition(100))\
    .with_exit_condition(True)

Now we can put all these tasks into a monitor.

In [8]:

monitor_tasks = [print_task, model_tboard_task, lml_tboard_task, custom_tboard_task, saver_task, sleep_task]
monitor = mon.Monitor(monitor_tasks, session, global_step)

Running the optimisation¶

We finally get to running the optimisation.

We may want to continue a previously run optimisation by resotring the TensorFlow graph from the latest checkpoint. Otherwise skip this step.

In [9]:

if os.path.isdir('./monitor-saves'):
    mon.restore_session(session, './monitor-saves')
    print('LML after loading: %f' % m.compute_log_likelihood())

INFO:tensorflow:Restoring parameters from ./monitor-saves/cp-900

INFO:tensorflow:Restoring parameters from ./monitor-saves/cp-900

LML after loading: -32615.572401

To check that the model has been correctly restored, we print out a model hyperparameter (1 at initialisation) and an optimiser variable (zeros at initialisation).

In [10]:

print(m.kern.lengthscales.value)
print(session.run(optimiser.optimizer.variables()[0]))

1.0
[[ 401.49689376]
 [ 288.26263405]
 [ -51.83913957]
 [ 236.47567674]
 [-100.38292958]]

In [11]:

with mon.Monitor(monitor_tasks, session, global_step, print_summary=True) as monitor:
    optimiser.minimize(m, step_callback=monitor, maxiter=450, global_step=global_step)

file_writer.close()

Iteration 10	total itr.rate 11.40/s	recent itr.rate 11.40/s	opt.step 910	total opt.rate 12.72/s	recent opt.rate 12.72/s
Iteration 20	total itr.rate 16.96/s	recent itr.rate 33.13/s	opt.step 920	total opt.rate 25.02/s	recent opt.rate 780.35/s
Iteration 30	total itr.rate 21.50/s	recent itr.rate 46.32/s	opt.step 930	total opt.rate 36.29/s	recent opt.rate 366.01/s
Iteration 40	total itr.rate 25.17/s	recent itr.rate 51.48/s	opt.step 940	total opt.rate 47.29/s	recent opt.rate 519.86/s
Iteration 50	total itr.rate 27.82/s	recent itr.rate 48.07/s	opt.step 950	total opt.rate 57.95/s	recent opt.rate 587.19/s
Iteration 60	total itr.rate 29.92/s	recent itr.rate 48.13/s	opt.step 960	total opt.rate 68.14/s	recent opt.rate 563.82/s
Iteration 70	total itr.rate 31.63/s	recent itr.rate 48.06/s	opt.step 970	total opt.rate 77.43/s	recent opt.rate 426.77/s
Iteration 80	total itr.rate 33.01/s	recent itr.rate 47.65/s	opt.step 980	total opt.rate 86.89/s	recent opt.rate 599.23/s
Iteration 90	total itr.rate 34.22/s	recent itr.rate 48.28/s	opt.step 990	total opt.rate 96.12/s	recent opt.rate 637.83/s
Iteration 100	total itr.rate 35.33/s	recent itr.rate 50.04/s	opt.step 1000	total opt.rate 104.69/s	recent opt.rate 530.72/s
Computing full lml...

100%|██████████| 100/100 [00:00<00:00, 465.36it/s]

Iteration 110	total itr.rate 29.51/s	recent itr.rate 11.14/s	opt.step 1010	total opt.rate 112.54/s	recent opt.rate 450.99/s
Iteration 120	total itr.rate 30.43/s	recent itr.rate 46.38/s	opt.step 1020	total opt.rate 119.94/s	recent opt.rate 432.72/s
Iteration 130	total itr.rate 31.23/s	recent itr.rate 45.47/s	opt.step 1030	total opt.rate 127.31/s	recent opt.rate 484.54/s
Iteration 140	total itr.rate 31.95/s	recent itr.rate 45.64/s	opt.step 1040	total opt.rate 133.81/s	recent opt.rate 397.59/s
Iteration 150	total itr.rate 32.56/s	recent itr.rate 44.57/s	opt.step 1050	total opt.rate 140.46/s	recent opt.rate 462.15/s
Iteration 160	total itr.rate 32.96/s	recent itr.rate 40.48/s	opt.step 1060	total opt.rate 146.42/s	recent opt.rate 402.79/s
Iteration 170	total itr.rate 33.52/s	recent itr.rate 45.79/s	opt.step 1070	total opt.rate 152.74/s	recent opt.rate 493.06/s
Iteration 180	total itr.rate 34.10/s	recent itr.rate 48.55/s	opt.step 1080	total opt.rate 158.83/s	recent opt.rate 493.83/s
Iteration 190	total itr.rate 34.56/s	recent itr.rate 45.54/s	opt.step 1090	total opt.rate 164.66/s	recent opt.rate 484.71/s

100%|██████████| 100/100 [00:00<00:00, 600.72it/s]

Iteration 200	total itr.rate 35.00/s	recent itr.rate 46.11/s	opt.step 1100	total opt.rate 170.24/s	recent opt.rate 479.07/s
Computing full lml...

Iteration 210	total itr.rate 33.57/s	recent itr.rate 18.50/s	opt.step 1110	total opt.rate 176.20/s	recent opt.rate 587.12/s
Iteration 220	total itr.rate 33.98/s	recent itr.rate 45.70/s	opt.step 1120	total opt.rate 181.26/s	recent opt.rate 456.23/s
Iteration 230	total itr.rate 34.34/s	recent itr.rate 44.49/s	opt.step 1130	total opt.rate 186.52/s	recent opt.rate 516.53/s
Iteration 240	total itr.rate 34.74/s	recent itr.rate 47.68/s	opt.step 1140	total opt.rate 191.25/s	recent opt.rate 459.12/s
Iteration 250	total itr.rate 35.10/s	recent itr.rate 46.61/s	opt.step 1150	total opt.rate 197.06/s	recent opt.rate 727.69/s
Iteration 260	total itr.rate 35.11/s	recent itr.rate 35.50/s	opt.step 1160	total opt.rate 201.54/s	recent opt.rate 466.21/s
Iteration 270	total itr.rate 35.45/s	recent itr.rate 47.03/s	opt.step 1170	total opt.rate 205.52/s	recent opt.rate 422.49/s
Iteration 280	total itr.rate 35.76/s	recent itr.rate 47.23/s	opt.step 1180	total opt.rate 209.48/s	recent opt.rate 436.70/s
Iteration 290	total itr.rate 36.03/s	recent itr.rate 45.63/s	opt.step 1190	total opt.rate 213.75/s	recent opt.rate 498.71/s

100%|██████████| 100/100 [00:00<00:00, 622.80it/s]

Iteration 300	total itr.rate 36.36/s	recent itr.rate 49.47/s	opt.step 1200	total opt.rate 218.43/s	recent opt.rate 596.23/s
Computing full lml...

Iteration 310	total itr.rate 35.28/s	recent itr.rate 18.66/s	opt.step 1210	total opt.rate 222.25/s	recent opt.rate 468.74/s
Iteration 320	total itr.rate 35.54/s	recent itr.rate 46.01/s	opt.step 1220	total opt.rate 225.08/s	recent opt.rate 371.33/s
Iteration 330	total itr.rate 35.76/s	recent itr.rate 44.41/s	opt.step 1230	total opt.rate 229.69/s	recent opt.rate 667.65/s
Iteration 340	total itr.rate 35.88/s	recent itr.rate 40.34/s	opt.step 1240	total opt.rate 232.81/s	recent opt.rate 421.93/s
Iteration 350	total itr.rate 36.08/s	recent itr.rate 44.39/s	opt.step 1250	total opt.rate 236.52/s	recent opt.rate 516.11/s
Iteration 360	total itr.rate 36.31/s	recent itr.rate 47.22/s	opt.step 1260	total opt.rate 240.28/s	recent opt.rate 540.74/s
Iteration 370	total itr.rate 36.52/s	recent itr.rate 46.15/s	opt.step 1270	total opt.rate 244.51/s	recent opt.rate 670.20/s
Iteration 380	total itr.rate 36.72/s	recent itr.rate 45.60/s	opt.step 1280	total opt.rate 247.75/s	recent opt.rate 485.35/s
Iteration 390	total itr.rate 36.91/s	recent itr.rate 46.12/s	opt.step 1290	total opt.rate 251.01/s	recent opt.rate 501.65/s

100%|██████████| 100/100 [00:00<00:00, 580.38it/s]

Iteration 400	total itr.rate 37.09/s	recent itr.rate 45.67/s	opt.step 1300	total opt.rate 254.89/s	recent opt.rate 641.85/s
Computing full lml...

Iteration 410	total itr.rate 36.22/s	recent itr.rate 18.68/s	opt.step 1310	total opt.rate 257.04/s	recent opt.rate 387.75/s
Iteration 420	total itr.rate 36.45/s	recent itr.rate 49.45/s	opt.step 1320	total opt.rate 260.32/s	recent opt.rate 546.02/s
Iteration 430	total itr.rate 36.64/s	recent itr.rate 47.14/s	opt.step 1330	total opt.rate 263.74/s	recent opt.rate 588.43/s
Iteration 440	total itr.rate 36.82/s	recent itr.rate 46.54/s	opt.step 1340	total opt.rate 265.82/s	recent opt.rate 403.08/s

  0%|          | 0/100 [00:00<?, ?it/s]

Iteration 450	total itr.rate 36.98/s	recent itr.rate 45.63/s	opt.step 1350	total opt.rate 267.56/s	recent opt.rate 375.17/s
Iteration 450	total itr.rate 36.60/s	recent itr.rate 0.00/s	opt.step 1350	total opt.rate 265.95/s	recent opt.rate 0.00/s
Computing full lml...

100%|██████████| 100/100 [00:00<00:00, 572.62it/s]

Tasks execution time summary:
print:	0.0136 (sec)
model_tboard:	0.1235 (sec)
lml_tboard:	0.9178 (sec)
custom_tboard:	1.0409 (sec)
saver:	4.3582 (sec)
sleep:	4.5475 (sec)

Now lets compute the log likelihood again. Hopefully we will see an increase in its value

In [12]:

print('LML after the optimisation: %f' % m.compute_log_likelihood())
print('Global step               : %i' % session.run(global_step))

LML after the optimisation: -24384.018335
Global step               : 1350

In this example we have used the TensorFlow AdamOptimizer. Using ScipyOptimizer requires a couple of special tricks. Firstly, this optimiser works with its own copy of trained variables and updates the original ones only when the optimisation is completed. Secondly, it doesn't use the global_step variable. This can present a problem when doing optimisation in several stages. Monitor has to use an iteration count instead of the global_step, which will be reset to zero at each stage.

To adress the first problem we will provide the optimiser as one of the parameters to the monitor. The monitor will make sure the orginal variables are updated whenever we access them from a monitoring task. The second problem is addressed by creating an instance of MonitorContext and providing it explicitely to the Monitor.

In [13]:

optimiser = gpflow.train.ScipyOptimizer()
context = mon.MonitorContext()

with mon.Monitor([print_task], session, print_summary=True, optimiser=optimiser, context=context) as monitor:
    optimiser.minimize(m, step_callback=monitor, maxiter=250)

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 13980.047398
  Number of iterations: 4
  Number of functions evaluations: 11

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
  Objective function value: 13980.047398
  Number of iterations: 4
  Number of functions evaluations: 11

Iteration 4	total itr.rate 5.75/s	recent itr.rate nan/s	opt.step 4	total opt.rate 5.75/s	recent opt.rate nan/s
Tasks execution time summary:
print:	0.0138 (sec)

In [ ]: