import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from prml.preprocessing import PolynomialFeature, GaussianFeature, SigmoidFeature
from prml.distribution import MultivariateGaussian
from prml.kernel import RBF, GaussianProcessRegression, GaussianProcessClassifier
from prml.datasets import generate_toy_data
# Set random seed to make deterministic
np.random.seed(0)
# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")
# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'
# Enable autoreload all modules before executing code
%reload_ext autoreload
%autoreload 2
In Chapter 3 and 4, we considered linear parametric models governed by a vector w of adaptive parameters. During the learning phase, a set of training data is used to obtain a point estimate of the parameter vector or determine their posterior distribution. Then, the training set may be discarded and predictions are based only on the learned parameters. The same approach is employed for non-linear models such as neural networks.
However, there is a class of techniques, in which the training data are kept and used also in the prediction phase. For instance, memory-based methods, such as Parzen density models and nearest-neighbors, store the entire training set in order to make predictions for future data points. These methods typically require a metric that measures the similarity of any pair of vectors in the input space. They are generally fast to train, because they just store the training data, and slow at making predictions, because they have to pass over the training set, possibly multiple times.
Interestingly, many linear parametric models can be re-cast into an equivalent dual representation in which the predictions are also based on linear combinations of a kernel function evaluated on the training data points. Assuming models based on a fixed nonlinear feature space mapping ϕ(x), the kernel function is defined by
k(x,x′)=ϕ(x)Tϕ(x′)where the kernel is a symmetric function of its arguments k(x,x′)=k(x′,x). To that end, the simplest example of a kernel function is obtained by considering the identity feature mapping, which is ϕ(x)=x, and thus k(x,x′)=xTx′ is referred as the linear kernel.
The concept of a kernel formulated as an inner product in a feature space allows us to build interesting extensions of well-known algorithms by making use of the kernel trick, also known as kernel substitution. The idea is to replace the scalar product of the input vector x in the formulation of interest with any kernel.
One of the most significant developments has been the extension of kernels to handle symbolic objects.
Consider a linear regression model whose parameters are determined by minimizing the regularized sum-of-squares error function given by
E(w)=12N∑n=1(wTϕ(xn)−tn)2+λ2wTwIf we set the gradient of E(w) with respect to w equal to zero, then, the solution for w takes the form of a linear combination of the vectors ϕ(xn), with coefficients an that are functions of w,
∇E(w)=0⇔∇12N∑n=1(wTϕ(xn)−tn)2+λ2wTw=0⇔N∑n=1(wTϕ(xn)−tn)ϕ(xn)+λw=0⇔λw=−N∑n=1(wTϕ(xn)−tn)ϕ(xn)⇔w=−1λN∑n=1(wTϕ(xn)−tn)ϕ(xn)⇔w=N∑n=1anϕ(xn)where an=−1λ(wTϕ(xn)−tn). Then, using a (N×M) design matrix Φ, whose nth row is given by ϕ(xn)T, and a=(a1,…,aN)T, we obtain
w=ΦTaThe dual representation is obtained if instead of working with the parameter vector w, we reformulate the least-squares algorithm in terms of the parameter vector a. If we substitute w=ΦTa into E(w), we obtain,
E(a)=12N∑n=1(aTΦϕ(xn)−tn)2+λ2aTΦΦTa=12||aTΦΦT−t||22+λ2aTΦΦTa=12(aTΦΦTΦΦTa−2aTΦΦTt+tTt)+λ2aTΦΦTa=12aTΦΦTΦΦTa−aTΦΦTt+12tTt+λ2aTΦΦTaBy defining the Gram matrix K=ΦΦT, which is a symmetric N×N matrix with elements, Knm=k(xn,xm)=ϕ(xn)Tϕ(xm), where k is a kernel function, the sum-of-squares error function can be written as,
E(a)=12aTKKa−aTKt+12tTt+λ2aTKaIf we set the gradient of E(a) equal to zero, then, we obtain the following solution,
∇E(a)=0⇔∇12aTKKa−aTKt+12tTt+λ2aTKa=0⇔KKa−Kt+Ka=0⇔Ka(K+λI)−Kt=0⇔Ka(K+λI)=Kt⇔a(K+λI)=t⇔a=(K+λI)−1tBy substituting this back into the linear regression model, we obtain,
y(x)=wTΦϕ(x)=aTΦϕ(x)=Φϕ(x)(K+λI)−1t=k(x)T(K+λI)−1twhere k(x)T=[k(x1,x),…,k(xN,x)]=[ϕ(x1)Tϕ(x),…,ϕ(xN)Tϕ(x)]
Therefore, the dual formulation allows the solution to the least-squares problem to be expressed entirely in terms of a kernel function.
NOTE: In the dual formulation, we determine the parameter vector a by inverting an N×N matrix, whereas in the original parameter space formulation, we had to invert an M×M matrix in order to determine w. Because N is typically much larger than M, the dual formulation does not seem to be particularly useful. However, theadvantage of the dual formulation is that it is expressed entirely in terms of the kernel function k(x,x′). Therefore, we work directly in terms of kernels and avoid the explicit introduction of the feature vector ϕ(x), which allows us to use feature spaces of high, even infinite, dimensionality.
In fact, the existence of a dual representation based on the Gram matrix is a property of many linear models. For instance, let's develop the same representation for the Perceptron algorithm. The update rule for the Perceptron is as follows,
wτ+1=wτ+ηϕ(xn)tnAssuming w0=0, then
w(1)=ηϕ(xn)tnw(2)=2ηϕ(xn)tn…w(τ)=τηϕ(xn)tnw(τ+1)=(τ+1)ηϕ(xn)tnor
w=N∑n=1ηcntnϕ(xn)=N∑n=1antnϕ(xn)By substituting w back to the original update formula, we obtain,
N∑n=1a(τ+1)ntnϕ(xn)=N∑n=1a(τ)ntnϕ(xn)+ηtnϕ(xn)⇔N∑n=1a(τ+1)n=N∑n=1a(τ)n+ηIn other words, the update process is to add learning rate η to the coefficient an, corresponding to input xn. Finally, substituting back to the linear model, we obtain,
y(x)=f(wTϕ(x))=f(N∑n=1antnϕ(xn)Tϕ(x))=f(N∑n=1antnk(xn,x))One approach is to choose a feature space mapping ϕ(x), e.g., polynomials, gaussian etc, and then use it to find the corresponding kernel.
x_space = np.linspace(-1, 1, 100)
# Create 12 degree polynomial basis functions
polynomial = PolynomialFeature(degree=12)
# Create 12 Gaussian basis functions
gaussian = GaussianFeature(mean=np.linspace(-1, 1, 12), sigma=0.1)
# Create 12 sigmoid basis functions
sigmoid = SigmoidFeature(mean=np.linspace(-1, 1, 12), sigma=0.1)
plt.figure(figsize=(15, 5))
for i, phi in enumerate([polynomial, gaussian, sigmoid]):
x = phi.transform(x_space)
plt.subplot(2, 3, i + 1)
for j in range(x.shape[1]):
plt.plot(x_space, x[:, j])
plt.title(phi.__class__.__name__.removesuffix("Feature").lower())
plt.subplot(2, 3, i + 4)
plt.plot(0, 0, "rx") if i > 0 else plt.plot(-0.5, 0, "rx")
plt.plot(x_space, x @ phi.transform(-0.5).T)
plt.show()
Another approach is to construct kernel functions directly. In order to exploit kernel substitution, we need to be able to construct valid kernel functions, or in other words kernels that correspond to a scalar product in some (perhaps infinite dimensional) feature space. Consider for instance, a kernel function given by k(x,z)=(xTz)2. Expanding the terms, we can thereby identify the corresponding nonlinear feature mapping,
k(x,z)=(xTz)2=(x1z1+x2z2+⋯+xNzN)2=N∑n=1x2nz2n+2N∑n=1n−1∑m=1xnznxmzm=(x21,√2x1x2,…,x2N)(z21,√2z1z2,…,z2N)T=ϕ(x)Tϕ(z)One standard technique for constructing kernels is to build them out of simpler kernels as building blocks. This can be done using the properties (6.13) to (6.22) appearing in the book. In general, the kernel should be symmetric and positive semidefinite and express the appropriate form of similarity between x and x′.
A commonly used kernel is the Gaussian kernel, which takes the form,
k(x,z)=exp(−||x−z||2/2σ2)In this context, it is not interpreted as a probability density, and hence the normalization coefficient is omitted.
By expanding the square we obtain,
||x−z||2=xTz−2xTz+(z)Tzthus,
exp(−(xTx−2xTz+(z)Tz)/2σ2)=exp(−xTx/2σ2)+exp(xTz/σ2)+exp(−zTz/2σ2)which is a valid kernel due to (6.14) and (6.16) and the fact that the linear kernel is valid. Note that the Gaussian kernel is not restricted to Euclidean distance. If we use kernel substitution to replace xTz with a nonlinear kernel κ(x,z), we obtain
k(x,z)=exp(−12σ2(κ(x,x)−2κ(x,z)+κ(z,z)))=Kernel functions may also be defined over generic objects, as diverse as graphs, sets, strings, and text documents. Consider, for instance, a set and define a non-vectorial space consisting of all possible subsets. If A1 and A2 are two such subsets, then one simple choice of kernel would be,
k(A1,A2)=2|A1∩A1|Radial basis functions have the property that each basis function depends only on the radial distance (typically Euclidean) from a centre μj, so that ϕj(x)=h(||x−μj||). Historically, radial basis functions were introduced for the purpose of exact function interpolation, which is achieved by expressing f(x) as a linear combination of radial basis functions, one centred on every data point. Then, the parameters w are found by least squares, and because there are the same number of parameters as there are constraints (data points), the resulted function fits every target value exactly. In machine learning applications, however, the target values are generally noisy, and exact interpolation is undesirable (over-fitting). Moreover, because there is one basis function associated with every data point, the corresponding model can be computationally costly to evaluate when making predictions for new data points.
One way of choosing the basis function centres is to randomly chose a subset of the data points. A more systematic approach is to use orthogonal least squares, a sequential selection process, in which, at each step the next data point to be chosen as a basis function centre corresponds to the one that gives the greatest reduction in the sum-of-squares error. Clustering algorithms such as k-means have also been used, which give a set of basis function centres that no longer coincide with training data points.
In order to motivate the Gaussian process viewpoint, let us return to the linear regression example and re-derive the predictive distribution by working in terms of distributions over functions y(x,w). Consider the model defined in terms of a linear combination of M fixed basis functions given by the elements of the vector ϕ(x) so that,
y(x)=wTϕ(x)Then, consider a prior distribution over w given by an isotropic Gaussian (see also Chapter 3) of the form,
p(w)=N(w|0,α−1I)Since, for any given value of w, y(x) defines a particular function of x, the probability distribution over w induces a probability distribution over functions y(x). For the training data points x1,…,xN, we are therefore interested in the joint distribution of the function values y(x1),…,y(xN), or y=Φw, where Φ is the design matrix.
In order to find the probability distribution of y, note that y is a linear combination of Gaussian distributed variables given by the elements of w and hence is itself Gaussian. We therefore need only to find its mean and covariance,
E[y]=E[Φw]=ΦE[w]=Φ0=0and
cov[y]=E[yyT]=E[ΦwwTΦT]=ΦE[wwT]ΦT=1αΦΦT=Kwhere K is the Gram matrix.
This model provides us with a particular example of a Gaussian process, defined by the linear regression model (6.49) with a weight prior (6.50). In cases where the input vector x is two dimensional, this is also known as a Gaussian random field.
Thus, Gaussian stochastic processes is the joint distribution over N specified completely by the second-order statistics. When, we have no prior knowledge about the mean of y(x) we typically take it to be zero. This is equivalent to choosing the mean of the prior over weight values p(w) to be zero in the basis function viewpoint. The specification of the Gaussian process is then completed by the covariance of y(x) evaluated at any two values of y(x), which is given by the kernel function
E[y(xn)y(xm)]=Knm=k(xn,xm)Towards defining kernel functions directly for the covariance, two common choices are the Gaussian kernel and the exponetial kernel given by,
k(x,z)=exp(−θ|x−z|)Recall that the noise on the observed target values for linear regression models may be modelled as,
tn=yn+ϵnwhere ϵn is a random noise variable whose value is chosen independently for each observation. To that end, the noise processes may be modelled as a Gaussian distribution
p(tn|yn)=N(tn|yn,β−1)Because the noise is independent for each data point, the joint distribution of the target values is given by an isotropic Gaussian of the form,
p(t|y)=N(t|y,β−1IN)From the definition of a Gaussian process, the marginal distribution p(y) is given by a Gaussian whose mean is zero and whose covariance is defined by a Gram matrix so that,
p(y)=N(y|0,K)The kernel function that determines K is typically chosen to express that, for points xn and xm that are similar, the corresponding values y(xn) and y(xm) will be more strongly correlated than for dissimilar points. A widely used kernel function for Gaussian process regression is given by the exponential of a quadratic form, and the addition of constant and linear terms to give,
k(xn,xm)=θ0exp{−θ12||xn−xm||2}+θ2+θ3xTnxmx_space = np.linspace(-1, 1, 100)[:, None]
def K(theta, x):
N = x.shape[0]
r = np.zeros((N, N))
for (i, _), xn in np.ndenumerate(x):
for (j, _), xm in np.ndenumerate(x):
r[i][j] = theta[0] * np.exp(-(theta[1] / 2) * np.linalg.norm(xn - xm) ** 2) + theta[2] + theta[3] * xn * xm
return r
plt.figure(figsize=(15, 4), tight_layout=True)
mu = np.zeros(x_space.shape)
for i, theta in enumerate([(1, 4, 0, 0), (9, 4, 0, 0), (1, 64, 0, 0), (1, 0.25, 0, 0), (1, 4, 10, 0), (1, 4, 0, 5)]):
prior_gaussian = MultivariateGaussian(mu, K(theta, x_space))
plt.subplot(2, 3, i + 1)
plt.plot(x_space, prior_gaussian.draw(5).T)
plt.title(f"$\\theta={theta}$")
plt.show()
In order to find the marginal distribution p(t), conditioned on the input values x1, . . . , xN , we need to integrate over y.
p(t)=∫p(t|y)p(y)dy=∫N(t|y,β−1IN)N(y|0,K)dy(2.115)=N(t|0,C)where C=β−1IN+K since A=I and b=0 in (2.115). This result reflects the convolution of two independent Gaussian sources of randomness (associated with y and ϵ), and thus their covariances simply sum.
We have used the Gaussian process viewpoint to build a model of the joint distribution over sets of data points. Our goal in regression, however, is to predict the target variables for new inputs, given a set of training data. This requires that we evaluate the predictive distribution p(tN+1|xN+1,tN,XN). To derive the conditional distribution, we start from the joint distribution p(tN+1), thus, from (6.61) we obtain,
p(tN+1)=N(tN+1|0,CN+1)where, applying the results from Section 2.3.1, we have defined the covariance matrix as follows,
CN+1=[CNkkTc]where vector k has elements k(xn,xN+1) and c=k(xN+1,xN+1)+β−1.
By analogy to Eq. (2.94) - (2.98), we can simply treat tN+1 as xa, tN as xb, c as Σaa, k as Σba, kT as Σab and CN as Σbb. Substituting them into Eq. (2.79) and Eq. (2.80) gives,
Λaa=(Σaa−ΣabΣ−1bbΣba)−1=(c−kTC−1Nk)−1and
Λab=−(Σaa−ΣabΣ−1bbΣba)−1ΣabΣ−1bb=−(c−kTC−1Nk)−1kTC−1NFor its mean μa|b, we have,
μa|b=0−(c−kTC−1Nk)(−(c−kTC−1Nk)−1kTC−1N)(tN−0)=(c−kTC−1Nk)((c−kTC−1Nk)−1kTC−1N)=(c−kTC−1Nk)(c−kTC−1Nk)−1kTC−1NtN=kTC−1NtNTherefore, the predictive distribution is as follows,
p(tN+1|tN)=N(tN+1|kTC−1NtN,c−kTC−1Nk)Note that for large training data sets, the direct application of Gaussian process methods may become infeasible.
x_space = np.linspace(0, 1, 100)
x, t = generate_toy_data(lambda x: np.sin(2 * np.pi * x), sample_size=10, std=0.5, domain=(0, 0.7))
gpr = GaussianProcessRegression(kernel=RBF(theta=np.ones(x.ndim) * 5), beta=100)
gpr.fit(x, t)
mu, sigma = gpr.predict(x_space)
plt.scatter(x, t)
plt.plot(x_space, np.sin(2 * np.pi * x_space), color="green", label="$\sin(2\pi x)$")
plt.plot(x_space, mu, color="red", label="GPR")
plt.fill_between(x_space, mu - 2 * np.sqrt(sigma), mu + 2 * np.sqrt(sigma), alpha=0.5, color="pink", label="std")
plt.legend()
plt.show()
The extension of the Gaussian process formalism to multiple target variables T=(t1,…,tN), also known as co-kriging is as follows,
p(tN+1|T)=N(tN+1|kTC−1NT,c−kTC−1Nk)which corresponds to a multi-variate Gaussian distribution.
In practice, rather than fixing the covariance function, we may use a parametric family of functions and infer the parameters from the data. These parameters may govern the length scale of the correlations and the precision of the noise. Techniques for learning these hyperparameters are based on the evaluation of the likelihood function p(t|θ) where θ denotes the hyperparameters of the Gaussian process model. The simplest approach is to make a point estimate of θ by maximizing the log likelihood function. Maximization of the log-likelihood can be done using efficient gradient-based optimization algorithms, such as conjugate gradients.
The log-likelihood function for a Gaussian process regression model is easily evaluated using the standard form for a multivariate Gaussian distribution, giving
lnp(t|θ)=lnN(t|0,C)=ln(1(2π)N/2|C|1/2exp{−12tTC−1t})=−ln((2π)N/2|C|1/2)−12tTC−1t=−N2ln(2π)−12ln|C|−12tTC−1tThen, we need the gradient of the log-likelihood function with respect to the parameter vector,
∂∂θilnp(t|θ)=−12∂∂θiln|C|−12∂∂θitTC−1t(C.21)=−12∂∂θiln|C|+12tTC−1∂C∂θiC−1t(C.22)=−12Tr(C−1∂C∂θi)+12tTC−1∂C∂θiC−1twhere the evaluation of the partial derivatives of C depends on the covariance functions (kernels).
Note that lnp(t|θ) is a nonconvex function and it can have multiple maxima.
Moreover, we have assumed that the contribution of the predictive variance arising from the additive noise β is a constant. For some problem, known as heteroscedastic, the noise variance itself depends on x. One solution is to introduce a second Gaussian process to represent the dependence of β on the input x.
Maximizing the likelihood for learning the length-scale parameter can usefully be extended by incorporating a separate parameter for each input variable. This allows the relative importance of different inputs to be inferred from the data, which represents an example of automatic relevance determination (ARD). Therefore, we may have a kernel function of the form,
k(x,z)=exp{−12||η(x−z)||2}As a particular parameter ηi becomes small, the function becomes relatively insensitive to the corresponding input variable xi. By adapting these parameters to a data set using maximum likelihood, we can detect input variables that have little effect on the predictive distribution, because the corresponding values of ηi will be small. This is useful in practice because it allows irrelevant inputs to be discarded.
The partial derivative of k(x,z) with respect to ηi is given by,
∂∂ηiexp{−12||η(x−z)||2}=exp{−12||η(x−z)||2}−12∂∂ηi||η(x−z)||2=exp{−12||η(x−z)||2}−12∂∂ηiN∑i=1ηi(xi−zi)2=exp{−12||η(x−z)||2}−12(xi−zi)2x0 = np.linspace(0, 1, 20)
x1 = x0 + np.random.normal(scale=0.1, size=20)
x2 = np.random.normal(scale=0.1, size=20)
t = np.sin(2 * np.pi * x0) + np.random.normal(scale=0.1, size=20)
x = np.vstack((x0, x1, x2)).T
x0 = np.linspace(0, 1, 100)
x1 = x0 + np.random.normal(scale=0.1, size=100)
x2 = np.random.normal(scale=0.1, size=100)
x_space = np.vstack((x0, x1, x2)).T
model = GaussianProcessRegression(kernel=RBF(np.array([1.0, 1.0, 1.0])), beta=100)
model.fit(x, t)
mu, sigma = model.predict(x_space)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x[:, 0], t)
plt.plot(x_space[:, 0], np.sin(2 * np.pi * x_space[:, 0]), color="green", label="$\sin(2\pi x)$")
plt.plot(x_space[:, 0], mu, color="red", label="GPR")
plt.fill_between(x_space[:, 0], mu - 2 * np.sqrt(sigma), mu + 2 * np.sqrt(sigma), alpha=0.5, color="pink", label="std")
plt.legend()
model = GaussianProcessRegression(kernel=RBF(np.array([1.0, 1.0, 1.0])), beta=100)
model.fit(x, t, 1000, 0.001)
mu_adr, sigma_adr = model.predict(x_space)
plt.subplot(1, 2, 2)
plt.scatter(x[:, 0], t)
plt.plot(x_space[:, 0], np.sin(2 * np.pi * x_space[:, 0]), color="green", label="$\sin(2\pi x)$")
plt.plot(
x_space[:, 0], mu_adr, color="red", label=f"GPR (ADR) {[np.round(p, 2) for p in model._kernel.theta.tolist()]}"
)
plt.fill_between(
x_space[:, 0],
mu_adr - 2 * np.sqrt(sigma_adr),
mu_adr + 2 * np.sqrt(sigma_adr),
alpha=0.5,
color="pink",
label="std",
)
plt.legend()
plt.show()
-- Iterations 0: 106.62105480846333 -- Iterations 100: 16.73499899402815 -- Iterations 200: 10.900800926148609 -- Iterations 300: 8.352834973210353 -- Iterations 400: 6.79206345414369 -- Iterations 500: 5.658719022565627 -- Iterations 600: 4.743421926165226 -- Iterations 700: 3.953116694126315 -- Iterations 800: 3.259548576802345 -- Iterations 900: 2.6734093844879148
Similar to logistic regression, we can adapt Gaussian process to classification problems by transforming the output using an approriate nonlinear activation function. Consider a binary problem with a target variable t∈{0,1}. If we define a Gaussian process over a function a(x) and then transform the function using a logistic sigmoid y=σ(a), we obtain a non-Gaussian stochastic process over functions y(x)∈(0,1).
As an example, we plot a sample from a Gaussian process prior over functions a(x) and the resulting transformation using the logistic sigmoid function.
x_space = np.linspace(-1, 1, 100)
def logistic(x):
return 1 / (1 + np.exp(-x))
k = RBF(theta=np.ones(x_space.ndim) * 100)
a = np.random.multivariate_normal(np.zeros(x_space.shape), k(x_space, x_space), 1)[0]
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(x_space, a)
bound = max(np.abs(a.min()), a.max())
plt.ylim(-bound, bound)
plt.title("$\\alpha$")
plt.subplot(1, 2, 2)
plt.plot(x_space, logistic(a), color="red")
plt.ylim(0, 1)
plt.title("$\sigma(\\alpha)$")
plt.show()
Our goal is to determine the predictive distribution p(tN+1|t). To that end, we introduce a Gaussian process prior over the vector aN+1=[a(x1),…,a(xN+1)]T, which in turn, defines a non-Gaussian process over tN+1, due to the nonlinear transformation. Then, by conditioning on the training data tN we obtain the required predictive distribution. The Gaussian process prior for aN+1 takes the form,
p(aN+1)=N(aN+1|0,CN+1)Unlike the regression case, the covariance matrix no longer includes a noise term because we assume that all training data points are correctly labelled. However, for numerical reasons it is convenient to introduce a noise-like term governed by a parameter β that ensures that the covariance matrix is positive definite. Thus, the covariance matrix has elements given by,
Cnm=k(xn,xm)+βδnmFor a binary classification problem, it is sufficient to predict p(tN+1=1|tN) because p(tN+1=0|tN)=1−p(tN+1=1|tN). Thus, the required predictive distribution is given by,
p(tN+1=1|tN)=∫p(tN+1=1|aN+1)p(aN+1|tN)daN+1(6.73)=∫σ(aN+1)p(aN+1|tN)daN+1This integral is analytically intractable, and so may be approximated. Similar to the previous chapters, we rely on the Laplacian approximation. In particular, we seek a Gaussian approximation to the posterior distribution over aN+1, which using the Bayes theorem, gives
p(aN+1|tN)=∫p(aN+1,aN|tN)daNp(Y|X)⇔p(X|Y)p(Y)p(X)=∫p(tN|aN+1,aN)p(aN+1,aN)p(tN)daNp(X,Y)⇔p(Y|X)p(X)=∫p(tN|aN)p(aN+1|aN)p(aN)p(tN)daNp(Y|X)⇔p(X|Y)p(Y)p(X)=∫p(aN+1|aN)p(aN|tN)daNThe conditional distribution is obtained similar to p(tN+1|tN) in Gaussian process regression, invoking (6.66) and (6.67),
p(aN+1|aN)=N(aN+1|kTC−1NaN,c−kTC−1Nk)Since p(tN+1=1|aN+1)=σ(aN+1), we may generate classification predictions using σ(kTC−1NaN).
# number of training points
N = 100
x, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=21
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = GaussianProcessClassifier(kernel=RBF(theta=np.array([0.5, 0.5])))
model.fit(x, t)
predicted, _ = model.predict(x_test)
plt.scatter(x[:, 0], x[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.show()
However, in order to derive the full predictive distribution we also need to find a Laplace approximation for the posterior distribution p(aN|tN) in order to evaluate the integrals (6.77) and (6.76). The posterior is proportional to the product of the likelihood and a prior,
p(aN|tN)∝p(tN|aN)p(aN)where the prior is given by a zero-mean Gaussian process with covariance matrix CN and the likelihood is given by,
p(tN|aN)=N∏i=1σ(ai)ti(1−σ(ai))ti−1Then, in order to obtain the Laplace approximation, we derive the logarithm of the posterior,
lnp(aN|tN)=ln(p(tN|aN)p(aN))=lnp(tN|aN)+lnp(aN)=−12aTNC−1NaN−N2ln(2π)−12ln|CN|+tTNaN−N∑n=1ln(1+ean)In a Bayesian neural network, the prior distribution over the parameter vector w, in conjunction with the network function f(x,w), produces a prior distribution over functions from y. It has been shown that for a broad class of prior distributions over w, the distribution of functions generated by the neural network tends to a Gaussian process in the hidden units limit of M→∞. Note however, that in this limit, the output variables of the neural network become independent. On the other hand, typical neural network outputs share the hidden units and so the hidden unit weights are influenced by all of the output variables. This property is therefore lost in the Gaussian process limit.