Bayesian Optimization with Gaussian Processes

February 18, 2016

DataPhilly

@AustinRochford

Principal Data Scientist, Monetate

In [8]:
display(gif)

The problem

Minimize a function that is difficult, expensive, or time-consuming to evaluate.

The idea

  • Represent our uncertainty about the true function with a probabilistic model
  • Choose the next evaluation point that maximizes the chances of finding the function's minimum

Gaussian processes

A flexible family of probability distributions over continuous functions

In [12]:
fig
Out[12]:

$f \sim GP(m, k)$ if for any $\mathbf{x} = (x_1, \ldots, x_n)$, $(f(x_1), \ldots, f(x_n))$ has a multivariate normal distribution


$$ (f(x_1), f(x_2), \ldots, f(x_n)) \sim N\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})\right) $$


$$ m(\mathbf{x}) = \begin{pmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{pmatrix},\ k(\mathbf{x}, \mathbf{x}) = \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix} $$

If $f \sim GP(0, k)$, $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, where $y_i = f(x_i) + \varepsilon$, $\varepsilon \sim N(0, \sigma^2)$, $f\ |\ \mathcal{D}$ is also a Gaussian process with


$$ \begin{align*} f(\mathbf{x^*})\ |\ \mathcal{D} & \sim N(\tilde{m}(\mathbf{x^*}), \tilde{k}(\mathbf{x^*}, \mathbf{x^*})) \end{align*} $$


$$ \begin{align*} \tilde{m}(\mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} \\ \tilde{k}(\mathbf{x^*}, \mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x^*}) - k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x} + \sigma^2 I\right)^{-1} k(\mathbf{x}, \mathbf{x^*}) \end{align*} $$
In [15]:
fig
Out[15]:

Covariance kernels

Squared exponential kernel

$$ \begin{align*} k(x_1, x_2) & = \sigma^2 \exp\left(-\frac{r^2}{2 \ell^2}\right) \\ r & = |x_1 - x_2| \end{align*} $$
  • Resulting functions are infinitely (mean square) differentiable
  • $\sigma$ controls the magnitude of fluctuations
  • $\ell$ controls the how quickly the function fluctates
In [17]:
fig
Out[17]: