Gaussian process regression is one of the techniques used by Bayesian optimisation to find the best hyperparameters of a machine learning algorithms and other optimisation problems with very expensive cost functions.
Rasmussen and Williams have written a very good introduction to Gaussian processes in general.
This blog post is about the absolute basics of understanding gaussian processes and how to use them via a nice python package called
The problem we would like to solve is: how do I model a complicated function if evaluating it is very expensive? For example:
The technical term for this is gaussian process regression. Let's get going!
import numpy as np import matplotlib.pyplot as plt import seaborn as sns import george from george.kernels import ExpSquaredKernel
For an example we always have a slight advantage as we have to choose the function that we are trying to discover. The reason we are at an advantage is that normally we do not know this function.
Let's choose $y\left( x \right) = -x \sin(x)$ in the range $0 < x < 10$.
xsinx = lambda x: -x*np.sin(x) t = np.linspace(0, 10, 500) plt.plot(t, xsinx(t)) plt.ylabel("y") plt.xlabel("x")
<matplotlib.text.Text at 0x10ebb5290>
Let's sample ten points at random $x$ values in one go (even if we won't use them all at one). We also add some gaussian noise to each point.
# Generate some fake, noisy data x = 10 * np.sort(np.random.rand(10)) yerr = 0.2 * np.ones_like(x) y = xsinx(x) + yerr * np.random.randn(len(x))
The main ingredient of a gaussian process is the kernel. It describes how correlated each point is with every other.
# Set up a Gaussian process kernel = ExpSquaredKernel(1.0) gp = george.GP(kernel)
What kind of functions to you get from this gaussian process before you give it any data? You can sample from them using the
sample() method. Below ten examples.
def label_and_limits(axis): ax.set_ylim(-9,7) ax.set_xlabel("x") ax.set_ylabel("y") fig, ax = plt.subplots() for pred in gp.sample(t, 10): ax.plot(t, pred) label_and_limits(ax)
If we choose a different kernel, the functions we sample will end up looking different.
from george.kernels import Matern32Kernel matern_kernel = Matern32Kernel(1.0) matern_gp = george.GP(matern_kernel) fig, ax = plt.subplots() for pred in matern_gp.sample(t, 10): ax.plot(t, pred) label_and_limits(ax)
So far so boring. Let's add the first four our data points to the gaussian process and see what kind of functions we now sample. By adding points we narrow down the space from which we can sample functions. As you can see all the sampled functions go close to the data points we added.
Nmax = 5 gp.compute(x[:Nmax], yerr[:Nmax]) fig, ax = plt.subplots() for pred in gp.sample_conditional(y[:Nmax], t, 10): ax.plot(t, pred) ax.errorbar(x[:Nmax], y[:Nmax], yerr=yerr[:Nmax], fmt='ok', zorder=3) label_and_limits(ax)
If we add one more point the range of possible functions, the range narrows even more:
Nmax = 6 gp.compute(x[:Nmax], yerr[:Nmax]) fig, ax = plt.subplots() for pred in gp.sample_conditional(y[:Nmax], t, 10): ax.plot(t, pred) ax.errorbar(x[:Nmax], y[:Nmax], yerr=yerr[:Nmax], fmt='ok', zorder=3) label_and_limits(ax)
Adding all our points reduces the space of possible functions even further. Here we plot the median predicted value and the one sigma band around it. You could get a similar effect by sampling several functions and plotting them with some transparency. In green the original function from which we sampled. Does not look to bad does it?
gp.compute(x, yerr) mu, cov = gp.predict(y, t) std = np.sqrt(np.diag(cov)) fig, ax = plt.subplots() ax.plot(t, mu, label="GP median") ax.plot(t, xsinx(t), label="x sin(x)") ax.fill_between(t, mu-std, mu+std, alpha=0.5) # explicit zorder to draw points and errorbars on top of everything ax.errorbar(x, y, yerr=yerr, fmt='ok', zorder=3, label="samples") ax.legend(loc='best')
<matplotlib.legend.Legend at 0x111be1dd0>
To bring it back to optimising a cost function while sampling it only a few times. A gaussian process provides you with its best guess and an uncertainty for every point of your function. At the beginning the guess is not very good, it repeats back at you what you put in as the prior for the possible functions. As you add more data however you get a better and better estimate. All without having to make any assumptions about the shape of the function you are looking for.
This is interesting for expensive functions as you can use your current knowledge to estimate where the minimum/maximum is, sample a further point at that location, and so focus your expensive sampling on locations which are promising in terms of being the best point.
If you find a mistake or want to tell me something else get in touch on twitter @betatim