# Bayesian Optimization with Gaussian Processes¶

## DataPhilly¶

### Principal Data Scientist, Monetate¶

In [8]:
display(gif)


## The problem¶

Minimize a function that is difficult, expensive, or time-consuming to evaluate.

### The idea¶

• Represent our uncertainty about the true function with a probabilistic model
• Choose the next evaluation point that maximizes the chances of finding the function's minimum

## Gaussian processes¶

A flexible family of probability distributions over continuous functions

In [12]:
fig

Out[12]:

$f \sim GP(m, k)$ if for any $\mathbf{x} = (x_1, \ldots, x_n)$, $(f(x_1), \ldots, f(x_n))$ has a multivariate normal distribution

$$(f(x_1), f(x_2), \ldots, f(x_n)) \sim N\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})\right)$$

$$m(\mathbf{x}) = \begin{pmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{pmatrix},\ k(\mathbf{x}, \mathbf{x}) = \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix}$$

If $f \sim GP(0, k)$, $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, where $y_i = f(x_i) + \varepsilon$, $\varepsilon \sim N(0, \sigma^2)$, $f\ |\ \mathcal{D}$ is also a Gaussian process with

\begin{align*} f(\mathbf{x^*})\ |\ \mathcal{D} & \sim N(\tilde{m}(\mathbf{x^*}), \tilde{k}(\mathbf{x^*}, \mathbf{x^*})) \end{align*}

\begin{align*} \tilde{m}(\mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} \\ \tilde{k}(\mathbf{x^*}, \mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x^*}) - k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x} + \sigma^2 I\right)^{-1} k(\mathbf{x}, \mathbf{x^*}) \end{align*}
In [15]:
fig

Out[15]:

### Covariance kernels¶

#### Squared exponential kernel¶

\begin{align*} k(x_1, x_2) & = \sigma^2 \exp\left(-\frac{r^2}{2 \ell^2}\right) \\ r & = |x_1 - x_2| \end{align*}
• Resulting functions are infinitely (mean square) differentiable
• $\sigma$ controls the magnitude of fluctuations
• $\ell$ controls the how quickly the function fluctates
In [17]:
fig

Out[17]: