import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from prml.kernel import (
SupportVectorClassifier,
SupportVectorRegressor,
RBF,
RelevanceVectorRegressor,
RelevanceVectorClassifier,
)
from prml.datasets import generate_toy_data
# Set random seed to make deterministic
np.random.seed(0)
# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")
# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'
# Enable autoreload all modules before executing code
%reload_ext autoreload
%autoreload 2
One significant limitation of many kernel-based methods (see Chapter 6) is that the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training data points. This can be computationally infeasible during training (does not scale to large datasets) and also leads to excessive computation overhead when making predictions for new data points.
On the other hand, there are kernel-based algorithms that yield sparse solutions (maintain a subset of training data points), so that predictions depend only on the kernel fuction evaluated at the subset of these training data points. We shall look into support vector machine (SVM), which are easy to train using convex optimization, but does not provide posterior probabilities. An alternative sparse kernel technique, known as relevance vector machine (RVM), is based on Bayesian formulation and provides posterior probabilistic outputs. Additionally, RVM has much sparser solutions than SVM, but it is slower to optimize.
Consider the classification problem using linear models of the form,
y(x)=wTϕ(x)+bThe training data set comprises N input vectors x1,…,xN and corresponding target values t1,…,tN, where tn∈{−1,1}.
We assume for the moment that the training dataset is linearly separable in feature space defined by ϕ, so that there exists at least one choice of parameters such y(xn)>0 for points having tn=+1 and y(xn)<0 for points having tn=−1. In general, so that tny(xn)>0 for all training data points.
The support vector machine approaches this problem through the concept of the margin, which is defined to be the smallest distance between the decision boundary and any of the samples. In support vector machine the decision boundary is chosen to be the one for which the margin is maximized. Recall that the perpendicular distance of a point xn from the decision boundary, defined by y(x)=0, is given by y(x)||w||2. Since we are only interested in solutions for which all data points are correctly classified, so that tny(xn)>0 for all n. Thus, the distance of a point xn to the decision surface is given by,
ds(xn)=tny(xn)||w||2=tn(wTϕ(xn)+b)||w||2Thus, the margin is defined by the closest point xn from the training data points. SVM goal is to optimize the parameters of y in order to maximize the margin or the distance of the closest point. Therefore, the maximum margin solution is found by solving,
argmaxw,b{ minnds(xn)}(7.2)=argmaxw,b{ minn[tn(wTϕ(xn)+b)||w||2]}=argmaxw,b{ 1||w||2minn[tn(wTϕ(xn)+b)]}The figure on the left depicts the margin as the distance between the decision boundary and the closest data point. On the right, the margin is maximized leadning to a particular choice of the decision boundary (determined by the parameters w and b). The subset of data points determining the location of the optimized boundary are called support vectors.
A direct solution of the above optimization problem is very complex, but there is an equivalent problem that is much easier to solve. Note that the rescaling w↦κw and b↦κb, does not affect the distance from any point xn to the decision surface,
tn(κwTϕ(xn)+κb)||κw||2=tnκ(wTϕ(xn)+b)κ||w||2=tn(wTϕ(xn)+b)||w||2Therefore, we can set tn(wTϕ(xn)+b)=1, for the point closest to the surface, which implies that all data points should satisfy the constraints tn(wTϕ(xn)+b)≥1. Data points for which the equality holds, the constraints are said to be active, whereas for the remainder they are said to be inactive.
Thus, the reformulated optimization problem simply requires that we maximize 1/||w|| or equivalently to minimizing ||w||22,
argminw,b12||w||22subject totn(wTϕ(xn)+b)≥1This is a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality constraints. In order to solve this constrained optimization problem, we introduce Lagrange multipliers an≥0 (one for each constraint), giving the Lagrangian function,
L(w,b,a)=12||w||22−N∑n=1an{tn(wTϕ(xn)+b)−1}Setting the derivatives of L with respect to w and b equal to zero, we obtain,
dLw=0⇔w−N∑n=1antnϕ(xn)=0⇔w=N∑n=1antnϕ(xn)and
dLb=0⇔N∑n=1antn=0Eliminating w and b by substituting these conditions back to L(w,b,a), gives the dual representation of the maximum margin problem,
˜L(a)=12⟨N∑n=1antnϕ(xn),N∑n=1antnϕ(xn)⟩−N∑n=1an{tn(N∑m=1amtmϕ(xm)Tϕ(xn)+b)−1}=12⟨N∑n=1antnϕ(xn),N∑n=1antnϕ(xn)⟩−N∑n=1N∑m=1antnamtmϕ(xm)Tϕ(xn)−bN∑n=1antn+N∑n=1an(7.9)=12⟨N∑n=1antnϕ(xn),N∑n=1antnϕ(xn)⟩−N∑n=1N∑m=1antnamtmϕ(xm)Tϕ(xn)+N∑n=1an=12N∑n=1antnN∑m=1amtmϕ(xm)Tϕ(xn)−N∑n=1N∑m=1antnamtmϕ(xm)Tϕ(xn)+N∑n=1an=N∑n=1an−12N∑n=1N∑m=1antnamtmϕ(xm)Tϕ(xn)or
˜L(a)=N∑n=1an−12N∑n=1N∑m=1antnamtmk(xn,xm)subject to the constraints,
an≥0N∑n=1antn=0In order to classify new data points, we evaluate the sign of y(x), which can be expressed in terms of the parameters a and the kernel function by substituting for w to give,
y(x)=wTϕ(x)+b=N∑n=1antnϕ(xn)Tϕ(x)+b=N∑n=1antnk(xn,x)+bNOTE: Any data point for which an=0 plays no role in making predictions. The remaining data points (where an>0) are called support vectors, and because they satisfy tny(xn)=1, they correspond to points that lie on the maximum margin hyperplanes in feature space. This property is central to the practical applicability of sparse kernel machines. Once the model is trained, a significant proportion of the data points can be discarded, since only the support vectors are required for making predictions.
Having found values for a, we can then determine the value of the threshold parameter b by noting that support vectors satisfy tny(xn)=1, thus,
tny(xn)=1⇔tn(∑m∈Samtmk(xm,xn)+b)=1×tn⇔t2n(∑m∈Samtmk(xm,xn)+b)=tnt2n=1⇔∑m∈Samtmk(xm,xn)+b=tn⇔b=tn−∑m∈Samtmk(xm,xn)Although the value of b may be found using any support vector xn, a more numerically stable solution is obtained by taking the average over all support vectors,
b=1NS∑n∈S(tn−∑m∈Samtmk(xm,xn))x = np.array([[0.0, 2.0], [2.0, 0.0], [-1.0, -1.0]])
t = np.array([1.0, 1.0, -1.0])
plt.scatter(x[:, 0], x[:, 1], s=40, c=[("g" if label == 1 else "r") for label in t], marker="x")
plt.show()
model = SupportVectorClassifier(kernel=None, C=np.inf)
model.fit(x, t)
print(f"Found {model.n_support_vectors} support vectors.")
Found 3 support vectors.
x0, x1 = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
xx = np.array([x0, x1]).reshape(2, -1).T
plt.scatter(x[:, 0], x[:, 1], s=40, c=[("g" if label == 1 else "r") for label in t], marker="x")
plt.scatter(model.support_vectors[:, 0], model.support_vectors[:, 1], s=100, facecolor="none", edgecolor="g")
cp = plt.contour(
x0,
x1,
model.predict(xx)[1].reshape(100, 100),
np.array([-1, 0, 1]),
colors="k",
linestyles=("dashed", "solid", "dashed"),
)
plt.clabel(cp, fmt="y=%.f", inline=True, fontsize=12)
plt.show()
So far, we have assumed that the training data points are linearly separable in the feature space. Thus, support vector machine gives an exact separation in the original input space, although the corresponding decision boundary is non-linear. In practice, however, the class-conditional distributions may overlap, in which case exact separation of the training data can lead to poor generalization.
We therefore need to modify the support vector machine so as to allow some of the training points to be misclassified. In the case of separable classes, we implicitly used an error function that gave infinite error if a data point was misclassified and zero error if it was classified correctly, and then optimized the model parameters to maximize the margin. We may modify this approach so that data points are allowed to be on the wrong side of the margin boundary, but having a penalty that increases proportionally to their distance from that boundary. To that end, we introduce slack variables, ξn≥0 (one for each training data point), for which ξn=0 when data points that are on or inside the correct margin boundary, and ξn=|tn−y(xn)| in any other case. Therefore, a data point that is on the decision boundary y(xn)=0 has ξn=1, and points having ξn>1 are misclassified. The exact classification constraints are then replaced by,
tny(xn)≥1−ξnThis known as relaxing the hard-margin constraint to give a soft-margin. Note that while slack variables allow for overlapping class distributions, this framework is still sensitive to outliers because the penalty for misclassification increases linearly.
Our goal is thus to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary,
argminw,b12||w||22+CN∑n=1ξnsubject totn(wTϕ(xn)+b)≥1−ξnξn≥0where the parameter C>0 controls the trade-off between the slack variable penalty and the margin. Because any point that is misclassified has ξn>1, it follows that ∑Nn=1ξn is an upper bound on the number of misclassified points. The parameter C is analogous to (the inverse of) a regularization coefficient because it controls the trade-off between minimizing training errors and controlling model complexity. In the limit C→∞, recovers the earlier support vector machine for separable data.
In order to solve the optimization, the corresponding Lagrangian is given by
L(w,b,a)=12||w||22+CN∑n=1ξn−N∑n=1an{tny(xn)−1+ξn}−N∑n=1μnξnand the derivatives for w and b are identical to the hard-margin case, while the derivatives for ξn is as follows,
∂L∂ξn=0⇔∂∂ξn(12||w||22+CN∑n=1ξn−N∑n=1an{tny(xn)−1+ξn}−N∑n=1μnξn)=0⇔C−∂∂ξn(N∑n=1an{tny(xn)−1+ξn})−μn=0⇔C−an−μn=0⇔C−μn=anUsing these results to eliminate w, b, ξn from the Lagrangian, we obtain the dual representation, which is identical to the separable case, except that we have an additional constraint stating that an≤C.
Regarding the solution to the quadratic programming problem tha arise in both cases (separable and non-separable), a global optimum solution can be found since the constraints define a convex region (as a consequence of being linear). Direct solution of a quadratic programming problem may be demanding (in terms of computation and memory) using traditional techniques. One popular approach to training support vector machines is sequential minimal optimization (SMO). SMO considers two Lagrange multipliers at a time, and in this case, the subproblem can be solved analytically, thereby avoiding numerical quadratic programming altogether. Moreover, heuristics exist to choose the best pair of Lagrange multipliers to be considered at each step. In practice, SMO scales proportionally to the number of data points that is somewhere between linear and quadratic.
x = np.random.uniform(-1, 1, 100).reshape(-1, 2)
t = x < 0
t = (t[:, 0] * t[:, 1]).astype(float)
t = 1 - 2 * t
plt.scatter(x[:, 0], x[:, 1], s=40, c=[("g" if label == 1 else "r") for label in t], marker="x")
plt.show()
model = SupportVectorClassifier(RBF(theta=np.ones(2)))
model.fit(x, t)
print(f"Found {model.n_support_vectors} support vectors.")
Found 19 support vectors.
x0, x1 = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
xx = np.array([x0, x1]).reshape(2, -1).T
plt.scatter(x[:, 0], x[:, 1], s=40, c=[("g" if label == 1 else "r") for label in t], marker="x")
plt.scatter(model.support_vectors[:, 0], model.support_vectors[:, 1], s=100, facecolor="none", edgecolor="g")
cp = plt.contour(x0, x1, model.predict(xx)[1].reshape(100, 100), colors="k", linestyles=("dashed", "solid", "dashed"))
plt.clabel(cp, fmt="y=%.f", inline=True, fontsize=12)
plt.show()
Let us take a look at how hyperparameters θ of an RBF kernel, and C for trade-off between the slack variable penalty and the margin, affect the resulted SVM model.
theta_grid = [1.0, 5.0, 10.0]
C_grid = [0.5, 1.0, 10.0]
plt.figure(figsize=(15, 10))
for i, c in enumerate(C_grid):
for j, theta in enumerate(theta_grid):
model = SupportVectorClassifier(RBF(theta=np.ones(2) * theta), C=c)
model.fit(x, t)
plt.subplot(3, 3, (j + 3 * i) + 1)
plt.scatter(x[:, 0], x[:, 1], s=40, c=[("g" if label == 1 else "r") for label in t], marker="x")
plt.scatter(model.support_vectors[:, 0], model.support_vectors[:, 1], s=100, facecolor="none", edgecolor="g")
plt.contourf(x0, x1, model.predict(xx)[0].reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.title(f"$C={c}$, $\\theta={theta}$, support vectors: {model.n_support_vectors}")
plt.axis("off")
plt.show()
Note that increasing θ makes the decision boundary smoother, and that increasing C results in less misclassification.
In order to study the relation of SVM to logistic regression, we can cast the SVM for non-separable distributions in terms of the minimization of a regularized error function. Data points that are on the correct side of the margin boundary, satisfy yntn≥1 and ξn=0, while for the remaining points we have ξn=1−yntn. Thus, the objective function in the form,
N∑n=1ESV(yntn)+λ||w||22where λ=(2C)−1, and ESV(⋅) is the hinge error function defined by,
ESV(yntn)=[1−yntn]+In the logistic regression model was convenient to work with target variable t∈{0,1}. Thus, for comparison with the support vector machine, we reformulate maximum likelihood logistic regression using the target variable t∈{−1,1}. To that end, note that p(t=1|y)=σ(y). Therefore, p(t=−1|y)=1−σ(y)=σ(−y), where we have used the properties of the logistic sigmoid function. Therefore, combining the above cases, we obtain p(t|y)=σ(yt).
Given this formulation, the negative logarithm of the likelihood function takes the form,
N∑n=1ELR(yntn)+λ||w||22For comparison with other error functions, we divide by ln(2) so that the error function passes through the point (0,1). Below we plot the reformulated error function for logistic regression along the hinge loss, the true misclassification error and the squared error.
z = np.linspace(-2, 2, 100)
plt.plot(z, np.log(1 + np.exp(-z)) / np.log(2), label="$E_{LR}$ error", color="red")
plt.plot(z, np.where(1 - z > 0, 1 - z, 0), label="$E_{SV}$ error", color="blue")
plt.plot(z, np.where(z > 0, 0, 1), label="Ideal error", color="black")
plt.plot(z, (1 - z) ** 2, label="Squared error", color="green")
plt.ylim(0, 4)
plt.legend()
plt.show()
Note that the logistic regression error ELR has a similar form to the support vector error function. The main difference is that the flat region in ESV leads to sparse solutions. Both the logistic error and the hinge loss can be viewed as continuous approximations to the misclassification error. The squared error, on the other hand, places increasing emphasis on data points that are correctly classified. These points are strongly weighted at the expense of misclassified points, and so if the objective is to minimize the misclassification rate, then a monotonically decreasing error function is a better choice.
The support vector machine is fundamentally a 2-class classifier. To tackle problems involving K>2 classes, usually we combine multiple 2-class SVMs in order to build a multiclass classifier. One commonly used approach is to train K separate SVMs, where the k-th model yk(x) is trained using the data from class Ck as the positive examples and the data from the remaining K−1 classes as the negative. This is known as the one-versus-the-rest approach. In order to avoid inconsistent results, that is, assigning input examples to multiple classes simultaneously, the predictions are made as follows,
y(x)=maxkyk(x)Issues:
Another approach is to train K(K−1)/2 different 2-class SVM on all possible pairs of classes, and then to classify test points according to which class has the highest number of votes, an approach that is sometimes called one-versus-one. Note that for large K this approach requires significantly more computation than the one-versus-the-rest approach.
In order to extend support vector machines to regression problems and obtain sparse solutions, the quadratic error function is replaced by an ϵ-insensitive error function, which gives zero error if the absolute difference between the prediction y(x) and the target t is less than ϵ where ϵ>0. A simple example of an ϵ-insensitive error function, having a linear cost associated with errors outside the insensitive region, is given by
Eϵ(y(x)−t)={0if|y(x)−t|<ϵ|y(x)−t|−ϵotherwiseepsilon = 0.25
z = np.linspace(-1, 1, 100)
plt.plot(z, z**2, color="green", label="Square Error")
plt.plot(z, np.where(np.abs(z) < epsilon, 0, np.abs(z) - epsilon), color="red", label="$E_\epsilon$")
plt.legend()
plt.show()
Thus, we can minimize a regularized error function given by,
CN∑n=1Eϵ(y(x)−t)+12||w||22where C is the (inverse) regularization parameter. For each data point xn, we introduce two slack variables ξn,ˆξn≥0, where ξn>0 corresponds to a point for which tn>y(xn)+ϵ, and ˆξn>0 corresponds to a point for which tn<y(xn)−ϵ. These slack variables allows points to lie outside the ϵ-tube provided the slack variables are nonzero, and the corresponding conditions are
tn≤y(xn)+ϵ+ξntn≥y(xn)−ϵ−ξnThus, the error function for support vector regression can then be written as
CN∑n=1(ξn+ˆξn)+12||w||22where ξn+ˆξn measures the amount by which data point xn lies outside the ϵ-tube.
epsilon = 0.25
x = np.linspace(-1, 1, 100)
y = np.sin(np.pi * x)
plt.plot(x, y, color="red")
plt.fill_between(
x,
y - epsilon,
y + epsilon,
alpha=0.5,
color="pink",
label="$\epsilon$-tube",
)
plt.text(-0.5, 0.5, "$\\xi > 0$")
plt.text(0.5, -0.5, "$\\hat{\\xi} > 0$")
plt.legend()
plt.show()
The error function can be minimized by introducing Lagrange multipliers an,ˆan,μn,ˆμn to obtain,
L(w,b,a,ˆa)=CN∑n=1(ξn+ˆξn)+12||w||22−N∑n=1(μnξn+ˆμnˆξn)−N∑n=1an(ϵ+ξn+yn−tn)−N∑n=1ˆan(ϵ+ˆξn−yn+tn)By setting the derivatives of the Lagrangian with respect to w, b, ξn, and ˆξn to zero, we obtain,
dLw=0⇔w−ddw(N∑n=1an(ϵ+ξn+yn−tn)+N∑n=1ˆan(ϵ+ˆξn−yn+tn))=0⇔w−ddw(N∑n=1an(ϵ+ξn+wTϕ(xn)+b−tn)+N∑n=1ˆan(ϵ+ˆξn−wTϕ(xn)−b+tn))=0⇔w−N∑n=1anϕ(xn)+N∑n=1ˆanϕ(xn)=0⇔w=N∑n=1(an−ˆan)ϕ(xn)dLb=0⇔−ddb(N∑n=1an(ϵ+ξn+yn−tn)+N∑n=1ˆan(ϵ+ˆξn−yn+tn))=0⇔−ddb(N∑n=1an(ϵ+ξn+wTϕ(xn)+b−tn)+N∑n=1ˆan(ϵ+ˆξn−wTϕ(xn)−b+tn))=0⇔−N∑n=1an+N∑n=1ˆan=0⇔N∑n=1(an−ˆan)=0while the derivatives for ξn, and ˆξn are identical to (7.31). Using these results to eliminate the corresponding variables from the Lagrangian, the dual representation is as follows,
˜L(a,ˆa)=CN∑n=1(ξn+ˆξn)+12||w||22−N∑n=1(μnξn+ˆμnˆξn)−N∑n=1an(ϵ+ξn+yn−tn)−N∑n=1ˆan(ϵ+ˆξn−yn+tn)=12||w||22+N∑n=1(C−μn)ξn+(C−ˆμn)ˆξn−N∑n=1an(ϵ+ξn+yn−tn)−N∑n=1ˆan(ϵ+ˆξn−yn+tn)=12||w||22+N∑n=1anξn+N∑n=1ˆanˆξn−N∑n=1an(ϵ+ξn+yn−tn)−N∑n=1ˆan(ϵ+ˆξn−yn+tn)=12||w||22−N∑n=1an(ϵ+yn−tn)−N∑n=1ˆan(ϵ−yn+tn)=12||w||22−ϵN∑n=1(an+ˆan)−N∑n=1(an−ˆan)yn+N∑n=1(an−ˆan)tn=12||w||22−ϵN∑n=1(an+ˆan)−N∑n=1(an−ˆan)(wTϕ(xn)+b)+N∑n=1(an−ˆan)tn=12||w||22−ϵN∑n=1(an+ˆan)−N∑n=1(an−ˆan)(N∑m=1(am−ˆam)ϕ(xm)Tϕ(xn)+b)+N∑n=1(an−ˆan)tn=12||w||22−ϵN∑n=1(an+ˆan)−N∑n=1N∑m=1(an−ˆan)(am−ˆam)ϕ(xm)Tϕ(xn)−bN∑n=1(an−ˆan)+N∑n=1(an−ˆan)tn=12||w||22−N∑n=1N∑m=1(an−ˆan)(am−ˆam)ϕ(xm)Tϕ(xn)−ϵN∑n=1(an+ˆan)+N∑n=1(an−ˆan)tn=−12N∑n=1N∑m=1(an−ˆan)(am−ˆam)k(xm,xn)−ϵN∑n=1(an+ˆan)+N∑n=1(an−ˆan)tnwhich we optimize together with the following box contraints,
0≤an≤C0≤ˆan≤CSubstituting (7.57) into (7.1), the predictions for new inputs are made using,
y=N∑n=1(an−ˆan)k(xn,x)+bThe support vectors are those data points that contribute to predictions given by (an−ˆan), or in other words those for which either an=0 or ˆan=0. These are points that lie on the boundary of the ϵ-tube or outside the tube. All points within the tube have an=ˆan=0. The parameter b can be found either by considering a data point for which 0<an<C or 0<ˆan<C. For such a point holds ξn=0, and thus, from (7.65) must therefore satisfy ϵ+yn−tn=0,
ϵ+yn−tn=0⇔yn=tn−ϵ⇔wTϕ(xn)+b=tn−ϵ⇔b=tn−ϵ−wTϕ(xn)In practice, its better to average over all such estimates of b.
x_space = np.linspace(-1, 1, 100)
t_space = np.sin(np.pi * x_space)
x, t = generate_toy_data(lambda x: np.sin(np.pi * x), 20, 0.25, domain=(-1, 1))
epsilon = 0.3
model = SupportVectorRegressor(kernel=RBF(theta=np.array(10)), epsilon=epsilon)
model.fit(x, t)
print(f"Found {model.n_support_vectors} support vectors.")
plt.scatter(x, t)
plt.plot(x_space, t_space, color="green", label="$\sin(2\pi x)$")
plt.fill_between(
x_space,
t_space - epsilon,
t_space + epsilon,
alpha=0.5,
color="pink",
label="$\epsilon$-tube",
)
f = (model._lambda * model._support_labels) @ model.kernel(model._support_vectors, x_space[:, None]) + model.b
plt.plot(x_space, f, color="red", label="SVM")
plt.legend()
plt.show()
SVMs suffer from a number of limitations. In particular, the outputs of an SVM represent decisions rather than posterior probabilities. Also, the SVM is formulated for two classes, and the extension to K>2 classes is problematic. Finally, the complexity parameter C, or ν (and ϵ in the case of regression), must be found using a hold-out methods. The relevance vector machine (RVM) is a Bayesian sparse kernel technique for regression and classification that is similar to SVM whilst avoiding its principal limitations. Additionally, it leads to much sparser models that yield faster performance on test data, while maintaining comparable generalization error.
The relevance vector machine for regression is a linear model, similar to the one presented in Chapter 3. The key difference is that we introduce a separate prior hyperparameter αi for each of the weight parameter wi instead of a single shared hyperparameter,
p(w|α)=M∏i=1N(wi|0,α−1i)We shall see that, maximizing the evidence with respect to these hyperparameters, a significant proportion of them go to infinity, and the corresponding weight parameters have posterior distributions that are concentrated at zero, thus resuling in sparse solutions, since the associated basis functions play no role in the predictions and so are effectively pruned out.
Using the result (3.49) for linear regression models, the posterior distribution for the weights is Gaussian and takes the form,
p(w|X,t,α,β)=N(w|m,Σ)where the mean and covariance are given by
m=βΣΦTtΣ=(A+βΦTΦ)−1where A=diag(αi). Note that if Φ=K, then K is the symmetric (N+1)×(N+1) kernel matrix.
The values of α and β are determined using type-2 maximum likelihood, also known as the evidence approximation, in which we maximize the marginal likelihood obtained by integrating out the weight parameters,
p(t|X,α,β)=∫p(t|w,X,α,β)p(w|α)dw=∫N(t|wTΦ,β−1I)N(w|0,αI)dw=(β2π)N/2∫exp{−β2||t−ΦTw||22}N(w|0,α−1I)dw=(β2π)N/21(2π)M/2|A|1/2∫exp{−β2||t−ΦTw||22}exp{−12wTAw}dwA is diagonal=(β2π)N/21(2π)M/2M∏m=1a1/2m∫exp{−β2||t−ΦTw||22}exp{−12wTAw}dweaeb=ea+b=(β2π)N/21(2π)M/2M∏m=1a1/2m∫exp{−12(β||t−ΦTw||22+wTAw)}dw=(β2π)N/21(2π)M/2M∏m=1a1/2m∫exp{−12E(w)}dwwhere
E(w)=β||t−ΦTw||22+wTAw=βtTt−2βtTΦTw+βwTΦTΦw+wTAw=βtTt−2βtTΦTw+wT(A+βΦTΦ)w=βtTt−2βtTΦTw+wTΣ−1wβtTΦT=mTΣ−1=βtTt−2mTΣ−1w+wTΣ−1w=βtTt−mTΣ−1w−mTΣ−1w+wTΣ−1w=βtTt−mTΣ−1w+(w−m)TΣ−1w=βtTt+(w−m)TΣ−1(w−m)−mTΣ−1mTherefore, substituting E(w) back into the integral we obtain,
p(t|X,α,β)=(β2π)N/21(2π)M/2M∏m=1a1/2m∫exp{−12E(w)}dw=(β2π)N/21(2π)M/2M∏m=1a1/2m∫exp{−12(βtTt+(w−m)TΣ−1(w−m)−mTΣ−1m)}dw=(β2π)N/21(2π)M/2M∏m=1a1/2mexp{12mTΣ−1m−β2tTt}∫exp{−12(w−m)TΣ−1(w−m)}dw=(β2π)N/21(2π)M/2M∏m=1a1/2m(2π)M/2|Σ|1/2exp{12mTΣ−1m−β2tTt}=(β2π)N/2M∏m=1a1/2m|Σ|1/2exp{12mTΣ−1m−β2tTt}=(β2π)N/2M∏m=1a1/2m|Σ|1/2exp{−12E(t)}where
E(t)=βtTt−mTΣ−1m=βtTt−β2tTΦΣΣ−1ΣΦTt=βtTt−β2tTΦΣΦTt=tT(βI−β2ΦΣΦT)t=tT(βI−βΦ(A+βΦTΦ)−1ΦTβ)t(C.7)Woodbury identity=tT(β−1I+ΦA−1ΦT)−1tand for the Woodbury identity (C.7) we have used the following mapping,
A=βIB=ΦC=ΦTD=ATherefore,
p(t|X,α,β)=(β2π)N/2M∏m=1a1/2m|Σ|1/2exp{−12tT(β−1I+ΦA−1ΦT)−1t}=(β2π)N/2M∏m=1a1/2m|Σ|1/2exp{−12tTC−1t}=N(t|0,C)where we have defined the N×N matrix C given by
C=β−1I+ΦA−1ΦTSince this represents the convolution of two Gaussians, it can be evaluated to give the log marginal likelihood in the form,
lnp(t|X,α,β)=lnN(t|0,C)=−12{Nln(2π)+ln|C|+tTC−1t}In order to maximize the resulting marginal likelihood with respect to the hyperparameters α and β, simply set the derivatives of the marginal likelihood to zero and obtain the following re-estimation equations,
αnewi=1−αiΣiim2i(βnew)−1=||t−Φm||22N−∑i(1−αiΣii)Learning therefore proceeds by choosing initial values for α and β, evaluating the mean and covariance of the posterior, and then alternately re-estimating the hyperparameters, and re-estimating the posterior mean and covariance, until a suitable convergence criterion is satisfied.
As a result of the optimization, a proportion of the hyperparameters αi are driven to large (in principle infinite) values, and so the corresponding weight parameters wi have posterior distributions with mean and variance both zero. Thus those parameters, and the corresponding basis functions, are removed from the model and play no role in making predictions. In the case of the dual representation, the inputs xn corresponding to the remaining non-zero weights are called relevance vectors, because they are identified through the mechanism of automatic relevance determination, and are analogous to the support vectors of an SVM.
Having found values α⋆ and β⋆ for the hyperparameters that maximize the marginal likelihood, we can evaluate the predictive distribution over t for a new input xnew,
p(t|xnew,X,t,α⋆,β⋆)=∫p(t|xnew,w,β⋆)p(w|X,t,α⋆,β⋆)dw=N(t|mTϕ(x),σ(x)2)Similar to Chapter 3, the result of the integral is a Gaussian distribution and the variance of the predictive distribution is given by,
σ(x)2=1β⋆+ϕ(x)TΣϕ(x)The following example presents an RVM applied to the sinusoidal regression data set:
x_space = np.linspace(-1, 1, 100)
t_space = np.sin(np.pi * x_space)
x, t = generate_toy_data(lambda x: np.sin(np.pi * x), 20, 0.2, domain=(-1, 1))
model = RelevanceVectorRegressor(kernel=RBF(theta=np.array([15.0])))
model.fit(x, t)
print(f"Found {model.n_relevance_vectors} relevance vectors.")
mean, variance = model.predict(x_space)
plt.scatter(x, t, facecolor="none", edgecolor="green", label="training vector")
plt.scatter(
model.relevance_vectors, model.relevance_labels, s=100, facecolor="none", edgecolor="blue", label="relevance vector"
)
plt.plot(x_space, t_space, color="green", label="$\sin(\pi x)$")
plt.plot(x_space, mean, color="red", label="RVM predicted mean")
plt.fill_between(x_space, mean - variance, mean + variance, alpha=0.25, color="pink", label="predicted $\sigma(x)$")
plt.legend()
plt.show()
Found 4 relevance vectors.
For many regression and classification tasks, the RVM is found to give models that are typically an order of magnitude more compact than the corresponding support vector machine, resulting in a significant improvement in the speed of processing on test data. Remarkably, the greater sparsity is achieved with little or no reduction in generalization error. The principal disadvantage of the RVM compared to the SVM is that training involves optimizing a non-convex function, and training times can be longer than for a comparable SVM.
For a model having M basis functions, the RVM requires inversion of a matrix of size M×M, which in general requires O(M3) computation. In the specific case of the kernel-based SVM model, we have M=N+1. Moreover, there are techniques for training SVMs whose cost is roughly quadratic in N. More significantly, in the relevance vector machine the parameters governing complexity and noise variance are determined automatically from a single training run, whereas in the support vector machine the parameters C and ϵ (or ν) are generally found using cross-validation.
The relevance vector machine framework can be extended to classification problems by applying the automatic relevance determination (ARD) prior over weights to a probabilistic linear classification model, as the one presented in Chapter 4,
y(x,w)=σ(wTϕ(x))To that end, we introduce a Gaussian prior over the weight vector w, having separate precision hyperparameter αi for each weight parameter. However, similar to logistic regression, we can no longer integrate analytically over the parameter vector, and therefore, similar to Bayesian logistic regression, we employ the Laplace approximation. For a given value of α, we build a Gaussian approximation for the posterior distribution of the parameters and thereby obtain an approximation to the marginal likelihood. Maximization of this approximate marginal likelihood allows us to obtain re-estimated values for α.
The posterior distribution over the parameters is obtained by,
p(w|t,X,α)=p(t|w,X)p(w|0,α−1I)Then, the mode of the posterior distribution is obtained by maximizing,
lnp(w|t,X,α)=lnp(t|w,X)+lnp(w|0,α−1I)=ln(N∏n=1ytnn(1−yn)1−tn)+ln(1(2π)M/2|A−1|1/2exp{−12wTAw})=N∑n=1{tnlnyn+(1−tn)ln(1−yn)}−12wTAw+ln(1(2π)M/2|A−1|1/2)=N∑n=1{tnlnyn+(1−tn)ln(1−yn)}−12wTAw−M2ln(2π)−12ln|A−1|where A=diag(αi). The maximization can be achieved using iterative reweighted least squares (IRLS). Therefore, the gradient vector and Hessian matrix of the log posterior distribution are given by,
∇lnp(w|t,X,α)=∇w[N∑n=1{tnlnyn+(1−tn)ln(1−yn)}−12wTAw]=N∑n=1∇w{tnlnyn+(1−tn)ln(1−yn)}−∇w12wTAw=N∑n=1∇w{tnlnyn+(1−tn)ln(1−yn)}−Aw(4.88)=N∑n=1{tn1ynyn(1−yn)ϕn−(1−tn)11−ynyn(1−yn)ϕn}−Aw=N∑n=1{tn(1−yn)ϕn−(1−tn)ynϕn}−Aw=N∑n=1{tnϕn−tnynϕn−ynϕn+tnynϕn}−Aw=N∑n=1(tn−yn)ϕn−Awand
∇∇lnp(w|t,X,α)=∇w[N∑n=1(tn−yn)ϕn−Aw]=N∑n=1∇w{(tn−yn)ϕn}−∇wAw=N∑n=1∇w{(tn−yn)ϕn}−A=−N∑n=1∇wynϕn−A(4.88)=−N∑n=1yn(1−yn)ϕnϕTn−A=−(ΦTBΦ+A)where B is a diagonal matrix having elements bn=yn(1−yn).
The elements of the negative Hessian can be interpreted as the inverse covariance matrix for the Gaussian approximation to the posterior distribution. The mode of the resulting approximation to the posterior distribution is obtained by setting the gradient vector to zero, giving the mean and covariance of the Laplace approximation in the form,
w⋆=A−1ΦT(t−y)Σ=(ΦTBΦ+A)−1Then, we can use the Laplace approximation of the posterior to evaluate the marginal likelihood. Using the general result (4.135) for an integral evaluated using the Laplace approximation, we have
p(t|X,α)=∫p(t|w,X)p(w|α)dw≈p(t|w⋆,X)p(w⋆|α)(2π)M/2|Σ|1/2=N∏n=1p(tn|w⋆,xn)N∏n=1N(w⋆i|0,αi)(2π)M/2|Σ|1/2=N∏n=1p(tn|w⋆,xn)N(w⋆|0,A)(2π)M/2|Σ|1/2Taking the logarithm of the marginal likelihood, we obtain,
lnp(t|X,α)=ln(N∏n=1ytnn(1−yn)1−tn)+lnN(w⋆|0,A)+ln((2π)M/2|Σ|1/2)=ln(N∏n=1ytnn(1−yn)1−tn)−12(w⋆)TAw⋆−12ln|A|−M2ln(2π)+M2ln(2π)+12ln|Σ|=ln(N∏n=1ytnn(1−yn)1−tn)−12(w⋆)TAw⋆−12ln|A|+12ln|Σ|=N∏n=1{tnlnyn+(1−tn)ln(1−yn)}−12(w⋆)TAw⋆−12ln|A|+12ln|Σ|Then, setting the derivative of the marginal likelihood with respect to αi equal to zero, we obtain,
∂lnp(t|X,α)∂αi=0⇔∂∂αiN∏n=1{tnlnyn+(1−tn)ln(1−yn)}−12∂∂αi(w⋆)TAw⋆−12∂∂αiln|A|+12∂∂αiln|Σ|=0⇔−12∂∂αi(w⋆)TAw⋆−12∂∂αiln|A|+12∂∂αiln|Σ|=0⇔−12(w⋆i)2−12∂∂αiln|A|+12∂∂αiln|Σ|=0⇔−12(w⋆i)2+12αi−12Σii=0⇔−(w⋆i)2+1αi−Σii=0⇔−αi(w⋆i)2+1−αiΣii=0⇔1−αiΣii=αi(w⋆i)2⇔αnewi=1−αiΣii(w⋆i)2which is identical to the re-estimation formula obtained for the regression case.
# number of training points
N = 100
x, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=21
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = RelevanceVectorClassifier(kernel=RBF(theta=np.array([1, 1])))
model.fit(x, t, n_iter=100)
print(f"Found {model.n_relevance_vectors} relevance vectors.")
predicted = model.predict(x_test)
plt.scatter(x[:, 0], x[:, 1], c=t)
plt.scatter(
model.relevance_vectors[:, 0],
model.relevance_vectors[:, 1],
s=100,
facecolor="none",
edgecolor="green",
label="relevance vector",
)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.legend()
plt.show()
Found 6 relevance vectors.
Note that the relevance vectors do not lie in the region of the decision boundary, in contrast to the support vector machine. This is consistent to the sparsity in the RVM, because a basis function ϕi(x) centered on a data point near the boundary has a vector φi that is poorly aligned with the training data vector t.