Use of the term "non-parametric" in the context of Bayesian analysis is something of a misnomer. This is because the first and fundamental step in Bayesian modeling is to specify a *full probability model* for the problem at hand. It is rather difficult to explicitly state a full probability model without the use of probability functions, which are parametric. Bayesian non-parametric methods do not imply that there are no parameters, but rather that the number of parameters grows with the size of the dataset. In fact, Bayesian non-parametric models are *infinitely* parametric.

In [1]:

```
%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd
import pymc3 as pm
import seaborn as sns
import theano
import theano.tensor as tt
import matplotlib.pylab as plt
import matplotlib.cm as cmap
sns.set_context('notebook')
import warnings
warnings.filterwarnings("ignore", module="theano")
warnings.filterwarnings("ignore", module="mkl_fft")
warnings.filterwarnings("ignore", module="matplotlib")
np.random.seed(42)
```

What if we chose to use Gaussian distributions to model our data?

$$p(x \mid \pi, \Sigma) = (2\pi)^{-k/2}|\Sigma|^{-1/2} \exp\left\{ -\frac{1}{2} (x-\mu)^{\prime}\Sigma^{-1}(x-\mu) \right\}$$There would not seem to be an advantage to doing this, because normal distributions are not particularly flexible distributions in and of themselves. However, adopting a set of Gaussians (a multivariate normal vector) confers a number of advantages. First, the marginal distribution of any subset of elements from a multivariate normal distribution is also normal:

$$p(x,y) = \mathcal{N}\left(\left[{ \begin{array}{c} {\mu_x} \\ {\mu_y} \\ \end{array} }\right], \left[{ \begin{array}{c} {\Sigma_x} & {\Sigma_{xy}} \\ {\Sigma_{xy}^T} & {\Sigma_y} \\ \end{array} }\right]\right)$$$$p(x) = \int p(x,y) dy = \mathcal{N}(\mu_x, \Sigma_x)$$Also, conditionals distributions of a subset of a multivariate normal distribution (conditional on the remaining elements) are normal too:

$$p(x|y) = \mathcal{N}(\mu_x + \Sigma_{xy}\Sigma_y^{-1}(y-\mu_y), \Sigma_x-\Sigma_{xy}\Sigma_y^{-1}\Sigma_{xy}^T)$$A Gaussian process generalizes the multivariate normal to infinite dimension. It is defined as an infinite collection of random variables, any finite subset of which have a Gaussian distribution. Thus, the marginalization property is explicit in its definition. Another way of thinking about an infinite vector is as a *function*. When we write a function that takes continuous values as inputs, we are essentially specifying an infinte vector that only returns values (indexed by the inputs) when the function is called upon to do so. By the same token, this notion of an infinite-dimensional Gaussian as a function allows us to work with them computationally: we are never required to store all the elements of the Gaussian process, only to calculate them on demand.

So, we can describe a Gaussian process as a ** disribution over functions**. Just as a multivariate normal distribution is completely specified by a mean vector and covariance matrix, a GP is fully specified by a

It is the marginalization property that makes working with a Gaussian process feasible: we can marginalize over the infinitely-many variables that we are not interested in, or have not observed.

For example, one specification of a GP might be as follows:

$$\begin{aligned} m(x) &=0 \\ k(x,x^{\prime}) &= \theta_1\exp\left(-\frac{\theta_2}{2}(x-x^{\prime})^2\right) \end{aligned}$$here, the covariance function is a **squared exponential**, for which values of $x$ and $x^{\prime}$ that are close together result in values of $k$ closer to 1 and those that are far apart return values closer to zero.

In [2]:

```
def exponential_cov(x, y, params):
return params[0] * np.exp( -0.5 * params[1] * np.subtract.outer(x, y)**2)
```

In [3]:

```
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))
xrange = np.linspace(0, 5)
ax1.plot(xrange, exponential_cov(0, xrange, [1, 1]))
ax1.set_xlabel('x')
ax1.set_ylabel('cov(0, x)')
z = np.array([exponential_cov(xrange, xprime, [1, 1]) for xprime in xrange])
ax2.imshow(z, cmap="inferno",
interpolation='none',
extent=(0, 5, 5, 0))
ax2.set_xlabel('x')
ax2.set_ylabel('x')
plt.tight_layout();
```

It may seem odd to simply adopt the zero function to represent the mean function of the Gaussian process -- surely we can do better than that! It turns out that most of the learning in the GP involves the covariance function and its parameters, so very little is gained in specifying a complicated mean function.

For a finite number of points, the GP becomes a multivariate normal, with the mean and covariance as the mean functon and covariance function evaluated at those points.

To make this notion of a "distribution over functions" more concrete, let's quickly demonstrate how we obtain realizations from a Gaussian process, which result in an evaluation of a function over a set of points. All we will do here is sample from the *prior* Gaussian process, so before any data have been introduced. What we need first is our covariance function, which will be the squared exponential, and a function to evaluate the covariance at given points (resulting in a covariance matrix).

We are going generate realizations sequentially, point by point, using the lovely conditioning property of mutlivariate Gaussian distributions. Here is that conditional:

$$p(y^*| x^*, y, x) = \mathcal{N}(\Sigma_{x^*x}\Sigma_y^{-1}y,\> \Sigma_{x^*}-\Sigma_{x^*x}\Sigma_y^{-1}\Sigma_{x^*x}^T)$$And this the function that implements it:

In [4]:

```
def conditional(x_new, x, y, params):
B = exponential_cov(x_new, x, params)
C = exponential_cov(x, x, params)
A = exponential_cov(x_new, x_new, params)
mu = np.linalg.inv(C).dot(B.T).T.dot(y)
sigma = A - B.dot(np.linalg.inv(C).dot(B.T))
return(mu.squeeze(), sigma.squeeze())
```

We will start with a Gaussian process prior with hyperparameters $\theta_0=1, \theta_1=10$. We will also assume a zero function as the mean, so we can plot a band that represents one standard deviation from the mean.

In [5]:

```
θ = [1, 10]
σ_0 = exponential_cov(0, 0, θ)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=σ_0, capsize=0)
plt.ylim(-3, 3);
```

Let's select an arbitrary starting point to sample, say $x=1$. Since there are no prevous points, we can sample from an unconditional Gaussian:

In [6]:

```
x = [1.]
y = [np.random.normal(scale=σ_0)]
y
```

Out[6]:

We can now update our confidence band, given the point that we just sampled, using the covariance function to generate new point-wise intervals, conditional on the value $[x_0, y_0]$.

In [7]:

```
σ_1 = exponential_cov(x, x, θ)
```

In [8]:

```
def predict(x, data, kernel, params, sigma, t):
k = [kernel(x, y, params) for y in data]
Sinv = np.linalg.inv(sigma)
y_pred = np.dot(k, Sinv).dot(t)
sigma_new = kernel(x, x, params) - np.dot(k, Sinv).dot(k)
return y_pred, sigma_new
```

In [9]:

```
x_pred = np.linspace(-3, 3, 1000)
predictions = [predict(i, x, exponential_cov, θ, σ_1, y) for i in x_pred]
```

In [10]:

```
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
plt.xlim(-3, 3); plt.ylim(-3, 3);
```

So conditional on this point, and the covariance structure we have specified, we have essentially constrained the probable location of additional points. Let's now sample another:

In [11]:

```
m, s = conditional([-0.7], x, y, θ)
y2 = np.random.normal(m, s)
y2
```

Out[11]:

This point is added to the realization, and can be used to further update the location of the next point.

In [12]:

```
x.append(-0.7)
y.append(y2)
```

In [13]:

```
σ_2 = exponential_cov(x, x, θ)
predictions = [predict(i, x, exponential_cov, θ, σ_2, y) for i in x_pred]
```

In [14]:

```
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
plt.xlim(-3, 3); plt.ylim(-3, 3);
```

Of course, sampling sequentially is just a heuristic to demonstrate how the covariance structure works. We can just as easily sample several points at once:

In [15]:

```
x_more = [-2.1, -1.5, 0.3, 1.8, 2.5]
mu, s = conditional(x_more, x, y, θ)
y_more = np.random.multivariate_normal(mu, s)
y_more
```

Out[15]:

In [16]:

```
x += x_more
y += y_more.tolist()
σ_new = exponential_cov(x, x, θ)
predictions = [predict(i, x, exponential_cov, θ, σ_new, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0)
plt.plot(x, y, "ro")
plt.ylim(-3, 3);
```

So as the density of points becomes high, the result will be one realization (function) from the prior GP.

This example, of course, is trivial because it is simply a random function drawn from the prior. What we are really interested in is *learning* about an underlying function from information residing in our data. In a parametric setting, we either specify a likelihood, which we then maximize with respect to the parameters, of a full probability model, for which we calculate the posterior in a Bayesian context. Though the integrals associated with posterior distributions are typically intractable for parametric models, they do not pose a problem with Gaussian processes.

The following simulated data clearly shows some type of non-linear process, corrupted by a certain amount of observation or measurement error so it should be a reasonable task for a Gaussian process approach.

In [17]:

```
# set the seed
np.random.seed(1)
n = 100 # The number of data points
x = np.linspace(0, 10, n)
X = x[:, None] # The inputs to the GP, they must be arranged as a column vector
# Define the true covariance function and its parameters
ℓ_true = 1.0
η_true = 3.0
cov_func = η_true**2 * pm.gp.cov.Matern52(1, ℓ_true)
# A mean function that is zero everywhere
mean_func = pm.gp.mean.Zero()
# The latent function values are one sample from a multivariate normal
# Note that we have to call `eval()` because PyMC3 built on top of Theano
f_true = np.random.multivariate_normal(mean_func(X).eval(),
cov_func(X).eval() + 1e-8*np.eye(n), 1).flatten()
# The observed data is the latent function plus a small amount of IID Gaussian noise
# The standard deviation of the noise is `sigma`
σ_true = 2.0
y = f_true + σ_true * np.random.randn(n)
## Plot the data and the unobserved latent function
fig = plt.figure(figsize=(12,5)); ax = fig.gca()
ax.plot(X, f_true, "dodgerblue", lw=3, label="True f");
ax.plot(X, y, 'ok', ms=3, alpha=0.5, label="Data");
ax.set_xlabel("X"); ax.set_ylabel("y"); plt.legend();
```

The `gp.Marginal`

class in PyMC3 implements the simplest case of GP regression: the observed data are the sum of a GP and Gaussian noise. `gp.Marginal`

has a `marginal_likelihood`

method, a `conditional`

method, and a `predict`

method. Given a mean and covariance function, the function $f(x)$ is modeled as,

The observations $y$ are the unknown function plus noise

$$ \begin{aligned} \epsilon &\sim N(0, \Sigma) \\ y &= f(x) + \epsilon \\ \end{aligned} $$The marginal likelihood is the normalizing constant for the posterior distribution, and is the integral of the product of the likelihood and prior.

$$p(y|X) = \int_f p(y|f,X)p(f|X) df$$where for Gaussian processes, we are marginalizing over function values $f$ (instead of parameters $\theta$).

**GP prior**:

**Gaussian likelihood**:

**Marginal likelihood**:

Notice that the marginal likelihood includes both a data fit term $- \frac{1}{2}y^T(K+\sigma^2I)^{-1}y$ and a parameter penalty term $\frac{1}{2}\log|K + \sigma^2I|$. Hence, the marginal likelihood can help us select an appropriate covariance function, based on its fit to the dataset at hand.

This is relevant because we have to make choices regarding the parameters of our Gaussian process; they were chosen arbitrarily for the random functions we demonstrated above.

For example, in the squared exponential covariance function, we must choose two parameters:

$$k(x,x^{\prime}) = \theta_1\exp\left(-\frac{\theta_2}{2}(x-x^{\prime})^2\right)$$The first parameter $\theta_1$ is a scale parameter, which allows the function to yield values outside of the unit interval. The second parameter $\theta_2$ is a length scale parameter that determines the degree of covariance between $x$ and $x^{\prime}$; smaller values will tend to smooth the function relative to larger values.

We can use the **marginal likelihood** to select appropriate values for these parameters, since it trades off model fit with model complexity. Thus, an optimization procedure can be used to select values for $\theta$ that maximize the marginial likelihood.

The behavior of individual realizations from the GP is governed by the covariance function. This function controls both the degree of *shrinkage* to the mean function and the *smoothness* of functions sampled from the GP.

PyMC3 includes a library of covariance functions to choose from. A flexible choice to start with is the Matèrn covariance.

$$k_{M}(x) = \frac{\sigma^2}{\Gamma(\nu)2^{\nu-1}} \left(\frac{\sqrt{2 \nu} x}{l}\right)^{\nu} K_{\nu}\left(\frac{\sqrt{2 \nu} x}{l}\right)$$where where $\Gamma$ is the gamma function and $K$ is a modified Bessel function. The form of covariance matrices sampled from this function is governed by three parameters, each of which controls a property of the covariance.

**amplitude**($\sigma$) controls the scaling of the output along the y-axis. This parameter is just a scalar multiplier, and is therefore usually left out of implementations of the Matèrn function (*i.e.*set to one)**lengthscale**($l$) complements the amplitude by scaling realizations on the x-axis. Larger values make points appear closer together.**roughness**($\nu$) controls the sharpness of ridges in the covariance function, which ultimately affect the roughness (smoothness) of realizations.

Though in general all the parameters are non-negative real-valued, when $\nu = p + 1/2$ for integer-valued $p$, the function can be expressed partly as a polynomial function of order $p$ and generates realizations that are $p$-times differentiable, so values $\nu \in \{3/2, 5/2\}$ are extremely common.

To provide an idea regarding the variety of forms or covariance functions, here's small selection of available ones:

In [18]:

```
X = np.linspace(0,2,200)[:,None]
# function to display covariance matrices
def plot_cov(X, K, stationary=True):
K = K + 1e-8*np.eye(X.shape[0])
x = X.flatten()
with sns.axes_style("white"):
fig = plt.figure(figsize=(14,5))
ax1 = fig.add_subplot(121)
m = ax1.imshow(K, cmap="inferno",
interpolation='none',
extent=(np.min(X), np.max(X), np.max(X), np.min(X)));
plt.colorbar(m);
ax1.set_title("Covariance Matrix")
ax1.set_xlabel("X")
ax1.set_ylabel("X")
ax2 = fig.add_subplot(122)
if not stationary:
ax2.plot(x, np.diag(K), "k", lw=2, alpha=0.8)
ax2.set_title("The Diagonal of K")
ax2.set_ylabel("k(x,x)")
else:
ax2.plot(x, K[:,0], "k", lw=2, alpha=0.8)
ax2.set_title("K as a function of x - x'")
ax2.set_ylabel("k(x,x')")
ax2.set_xlabel("X")
fig = plt.figure(figsize=(14,4))
ax = fig.add_subplot(111)
samples = np.random.multivariate_normal(np.zeros(200), K, 5).T;
for i in range(samples.shape[1]):
ax.plot(x, samples[:,i], color=cmap.inferno(i*0.2), lw=2);
ax.set_title("Samples from GP Prior")
ax.set_xlabel("X")
```

This is the squared exponential covariance function.

In [19]:

```
with pm.Model() as model:
l = 0.2
tau = 2.0
b = 0.5
cov = b + tau * pm.gp.cov.ExpQuad(1, l)
K = theano.function([], cov(X))()
plot_cov(X, K)
```

In [20]:

```
with pm.Model() as model:
l = 0.2
tau = 2.0
cov = tau * pm.gp.cov.Matern32(1, l)
K = theano.function([], cov(X))()
plot_cov(X, K)
```

In [21]:

```
with pm.Model() as model:
l = 0.2
tau = 2.0
cov = tau * pm.gp.cov.Cosine(1, l)
K = theano.function([], cov(X))()
plot_cov(X, K)
```

Now that we have a general idea about covariance functions, let's begin by defining one for our first model.

We can use a Matern(5/2) covariance to model our simulated data, and pass this as the `cov_func`

argument to the `Marginal`

class.

In [22]:

```
with pm.Model() as model:
ℓ = pm.Gamma("ℓ", alpha=2, beta=1)
η = pm.HalfCauchy("η", beta=5)
cov = η**2 * pm.gp.cov.Matern52(1, ℓ)
mean = pm.gp.mean.Constant(c=1)
gp = pm.gp.Marginal(mean_func=mean, cov_func=cov)
```

`.marginal_likelihood`

method¶The unknown latent function can be analytically integrated out of the product of the GP prior probability with a normal likelihood. This quantity is called the marginal likelihood.

$$ p(y \mid x) = \int p(y \mid f, x) \, p(f \mid x) \, df $$The log of the marginal likelihood, $p(y \mid x)$, is

$$ \log p(y \mid x) = -\frac{1}{2} (\mathbf{y} - \mathbf{m}_x)^{T} (\mathbf{K}_{xx} + \boldsymbol\Sigma)^{-1} (\mathbf{y} - \mathbf{m}_x) - \frac{1}{2}|\mathbf{K}_{xx} + \boldsymbol\Sigma| - \frac{n}{2}\log (2 \pi) $$$\boldsymbol\Sigma$ is the covariance matrix of the Gaussian noise. Since the Gaussian noise doesn't need to be white to be conjugate, the `marginal_likelihood`

method supports either using a white noise term when a scalar is provided, or a noise covariance function when a covariance function is provided.

The `gp.marginal_likelihood`

method implements the quantity given above.

In [23]:

```
X = x.reshape(-1, 1)
with model:
σ = pm.HalfCauchy("σ", beta=5)
obs = gp.marginal_likelihood("obs", X=X, y=y, noise=σ)
```

In [24]:

```
with model:
marginal_post = pm.find_MAP()
```

We can collect the results into a pandas dataframe to display

In [25]:

```
pd.DataFrame({"Parameter": ["ℓ", "η", "σ"],
"Value at MAP": [float(marginal_post["ℓ"]), float(marginal_post["η"]), float(marginal_post["σ"])],
"True value": [ℓ_true, η_true, σ_true]})
```

Out[25]:

`.conditional`

distribution¶In addition to fitting the model, we would like to be able to generate predictions. This implies sampling from the posterior predictive distribution, which if you recall is just some linear algebra:

$$\begin{aligned} m^*(x^*) &= k(x^*,x)^T[k(x,x) + \sigma^2I]^{-1}y \\ k^*(x^*) &= k(x^*,x^*)+\sigma^2 - k(x^*,x)^T[k(x,x) + \sigma^2I]^{-1}k(x^*,x) \end{aligned}$$PyMC3 allows for predictive sampling after the model is fit, using the recorded values of the model parameters to generate samples. The `conditional`

method implements the predictive GP above, called with a grid of points over which to generate realizations:

The `.conditional`

has an optional flag for `pred_noise`

, which defaults to `False`

. When `pred_noise=False`

, the `conditional`

method produces the predictive distribution for the underlying function represented by the GP. When `pred_noise=True`

, the `conditional`

method produces the predictive distribution for the GP plus noise.

If using an additive GP model, the conditional distribution for individual components can be constructed by setting the optional argument `given`

.

We can define a grid of new values from `x=0`

to `x=20`

, then add the GP conditional to the model, given the new X values:

In [26]:

```
X_new = np.linspace(0, 20, 600)[:,None]
with model:
f_pred = gp.conditional("f_pred", X_new)
```

We can draw samples from the posterior predictive distribution over the specified grid of values.

In [27]:

```
with model:
pred_samples = pm.sample_posterior_predictive([marginal_post], vars=[f_pred], samples=500)
```

In [28]:

```
# plot the results
fig = plt.figure(figsize=(12,5)); ax = fig.gca()
# plot the samples from the gp posterior with samples and shading
from pymc3.gp.util import plot_gp_dist
plot_gp_dist(ax, pred_samples["f_pred"], X_new);
# plot the data and the true latent function
plt.plot(X, f_true, "dodgerblue", lw=3, label="True f");
plt.plot(X, y, 'ok', ms=3, alpha=0.5, label="Observed data");
# axis labels and title
plt.xlabel("X"); plt.ylim([-13,13]);
plt.title("Posterior distribution over $f(x)$ at the observed values"); plt.legend();
```

The prediction also matches the results from `gp.Latent`

very closely. What about predicting new data points? Here we only predicted $f_*$, not $f_*$ + noise, which is what we actually observe.

The `conditional`

method of `gp.Marginal`

contains the flag `pred_noise`

whose default value is `False`

. To draw from the *posterior predictive* distribution, we simply set this flag to `True`

.

In [29]:

```
with model:
y_pred = gp.conditional("y_pred", X_new, pred_noise=True)
y_samples = pm.sample_posterior_predictive([marginal_post], vars=[y_pred], samples=500)
```

In [30]:

```
fig = plt.figure(figsize=(12,5)); ax = fig.gca()
# posterior predictive distribution
plot_gp_dist(ax, y_samples["y_pred"], X_new, plot_samples=False, palette="bone_r");
# overlay a scatter of one draw of random points from the
# posterior predictive distribution
plt.plot(X_new, y_samples["y_pred"][200, :].T, "co", ms=2, label="Predicted data");
# plot original data and true function
plt.plot(X, y, 'ok', ms=3, alpha=1.0, label="observed data");
plt.plot(X, f_true, "dodgerblue", lw=3, label="true f");
plt.xlabel("x"); plt.ylim([-13,13]);
plt.title("posterior predictive distribution, y_*"); plt.legend();
```

That was contrived data; let's try applying Gaussian processes to a real problem. The plot below shows the relationship between the number of spawning salmon in a particular stream and the number of fry that are recruited into the population in the spring.

We would like to model this relationship, which appears to be non-linear (we have biological knowledge that suggests it should be non-linear too).

In [31]:

```
salmon_data = pd.read_table('../data/salmon.txt', sep='\s+', index_col=0)
salmon_data.plot.scatter(x='spawners', y='recruits', s=50);
```

In [32]:

```
with pm.Model() as salmon_model:
# Lengthscale
ρ = pm.HalfCauchy('ρ', 3)
η = pm.HalfCauchy('η', 3)
M = pm.gp.mean.Linear(coeffs=(salmon_data.recruits/salmon_data.spawners).mean())
K = (η**2) * pm.gp.cov.ExpQuad(1, ρ)
σ = pm.HalfCauchy('σ', 1)
recruit_gp = pm.gp.Marginal(mean_func=M, cov_func=K)
recruit_gp.marginal_likelihood('recruits', X=salmon_data.spawners.values.reshape(-1,1),
y=salmon_data.recruits.values, noise=σ)
```

In [34]:

```
with salmon_model:
salmon_trace = pm.sample(1000, tune=3000, chains=2)
```

In [35]:

```
pm.traceplot(salmon_trace, varnames=['ρ', 'η', 'σ']);
```