import math
import numpy as np
import matplotlib.pyplot as plt
from prml.datasets import generate_toy_data
from prml.distribution import Gaussian, MultivariateGaussian
from prml.preprocessing import PolynomialFeature, GaussianFeature, SigmoidFeature
from prml.linear import LinearRegression, RidgeRegression, BayesianRegression, EvidenceApproximation
# Set random seed to make deterministic
np.random.seed(0)
# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")
# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'
The goal of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables. The polynomial curve that we used in Chapter 1 belongs to a broader class of functions called linear regression models, that are linear functions of the adjustable parameters. The simplest form of linear regression models are also linear functions of the input variables x. However, a much more useful class of functions is the linear combinations of a fixed set of nonlinear functions of the input variables, known as basis functions.
The simplest linear model for regression is one that involves a linear combination of the input variables
y(x,w)=w0+w1x1+⋯+wDxDwhich is simply known as linear regression. This model is a linear function of the parameters and a linear function of the input variables, and this imposes significant limitations on the model. We therefore extend the class of models by considering linear combinations of fixed nonlinear functions of the input variables, of the form
y(x,w)=w0+M−1∑j=1wjϕj(x)=M−1∑j=0wjϕj(x)=wTϕ(x)where ϕj(x) are known as basis functions.
The polynomial regression considered in Chapter 1 is an example of this model in which there is a single input variable x, and the basis functions take the form of powers of x so that ϕj(x)=xj. There is a plethora of possible choices for the basis functions, for instance
ϕj(x)=exp{−(x−μj)22s2}are referred to as Gaussian basis functions, where μj govern the locations of the functions in input space, and s governs their spatial scale. Another possibility is the sigmoidal basis function of the form
ϕj(x)=σ(x−μjs)where σ(a) is the logistic sigmoid function defined by
σ(a)=11+exp(−a)We present examples of these three families of basis functions using different parameters.
x_space = np.linspace(-1, 1, 100)
# Create 10 degree polynomial basis functions
x_polynomial = PolynomialFeature(degree=12).transform(x_space)
# Create 10 Gaussian basis functions
x_gaussian = GaussianFeature(mean=np.linspace(-1, 1, 12), sigma=0.1).transform(x_space)
# Create 10 sigmoid basis functions
x_sigmoid = SigmoidFeature(mean=np.linspace(-1, 1, 12), sigma=0.1).transform(x_space)
plt.figure(figsize=(20, 5))
for i, x in enumerate([x_polynomial, x_gaussian, x_sigmoid]):
plt.subplot(1, 3, i + 1)
for j in range(x.shape[1]):
plt.plot(x_space, x[:, j])
In Chapter 1, we fitted polynomial functions to data sets by minimizing a sum-of-squares error function. We also showed that this error function could be motivated as the maximum likelihood solution under an assumed Gaussian noise model. Let us return to the discussion of Chapter 1 and consider the least squares approach, and its relation to maximum likelihood, in more detail.
As before, we assume that the target variable t is given by a deterministic function y(x,w) having additive Gaussian noise so that
t=y(x,w)+ϵwhere ϵ is a zero mean Gaussian random variable with precision (inverse variance) β. Thus we can write
p(t|x,w,β)=N(t|y(x,w),β−1)Note that the Gaussian noise assumption implies that the conditional distribution of t given x is unimodal, which may be inappropriate for some applications.
Now consider a data set of inputs X={x1,…,xN} along corresponding target alues t=(t1,…,tN)T. Assuming that the data points are i.i.d, we obtain the likelihood function (function of the adjustable parameters w and β), in the form
p(t|X,w,β)=N∏n=1N(tn|y(xn,w),β−1)=N∏n=1N(tn|wTϕ(xn),β−1)NOTE: In many textbooks, the input variables x are dropped from the set of conditioning variables, since, we do not seek to model the distribution of x.
Taking the logarithm of the likelihood function, we have,
lnp(t|X,w,β)=N∑n=1lnN(tn|wTϕ(xn),β−1)=N2lnβ−N2ln2π−β2N∑n=1(tn−wTϕ(xn))2=N2lnβ−N2ln2π−βED(w)By maximizing likelihood we can determine the parameters w and β. As already observed in Chapter 1 the maximization under a conditional Gaussian noise distribution is equivalent to minimizing the sum-of-squares error function given by ED(w). The gradient of the log likelihood function takes the form
∇p(t|X,w,β)=N∑n=1(tn−wTϕ(xn))ϕ(xn)TSetting this gradient to zero and solving for w gives
wML=(ΦTΦ)−1ΦTtwhich are known as the normal equations for the least squares problem. Here Φ is an N×M matrix, called the design matrix, whose elements are given by Φnj=ϕj(xn), so that
Φ=(ϕ0(x1)ϕ1(x1)⋯ϕM−1(x1)ϕ0(x2)ϕ1(x2)⋯ϕM−1(x2)⋮⋮⋱⋮ϕ0(xN)ϕ1(xN)⋯ϕM−1(xN))By maximizing the log likelihood function over the noise precision parameter β, we obtain
βML=1NN∑n=1(tn−wTϕ(xn))2# Create polynomial, gaussian and sigmoid basis functions
polynomial_basis = PolynomialFeature(degree=10)
gaussian_basis = GaussianFeature(mean=np.arange(-9, 9, 0.5), sigma=1)
sigmoid_basis = SigmoidFeature(mean=np.arange(-9, 9, 0.5), sigma=1)
# Lets train a linear regression model on a couple of datasets
model = LinearRegression()
plt.figure(figsize=(20, 5))
X = np.arange(-10, 10, 0.5).reshape((-1, 1))
y = np.sin(X) + np.random.randn(X.shape[0], X.shape[1]) * 0.3
for i, basis in enumerate([polynomial_basis, gaussian_basis, sigmoid_basis]):
plt.subplot(2, 3, i + 1)
plt.scatter(X, y)
x_train_features = basis.transform(X)
model.fit(x_train_features, y)
plt.plot(X, model.predict(x_train_features)[0], color="red")
plt.title(["Polynomial Basis", "Gaussian Basis", "Sigmoid Basis"][i], fontsize=14)
X = np.arange(-10, 10, 0.5)
y = 0.8 * abs(X) ** 0.3 + 2 + np.random.randn(X.shape[0]) * 0.1
for i, basis in enumerate([polynomial_basis, gaussian_basis, sigmoid_basis]):
plt.subplot(2, 3, i + 4)
plt.scatter(X, y)
x_train_features = basis.transform(X)
model.fit(x_train_features, y)
plt.plot(X, model.predict(x_train_features)[0], color="red")
plt.show()
Batch learning, such as the maximum likelihood solution, requires processing of the entire training set at once. However, this can be computationally costly for large datasets. Sequential learning or online learning algorithms consider data points one at a time and update the model parameters after processing each point. We can obtain a sequential learning algorithm by applying a technique called stochastic gradient descent also known as sequential gradient descent.
If an error function comprises a sum over data points, then for a data point n, the stochastic gradient descent updates the parameter vector w as follows,
w(τ+1)=w(τ)−η∇Enwhere τ is the iteration number and η is a learning rate parameter. The value of w(0) can be initialized to some random vector. For the case of the sum-of-squares error function, this gives
w(τ+1)=w(τ)−η(w(τ)Tϕn−tn)ϕnThis is also known as the least-mean-squares (LMS) algorithm or more commonly as gradient descent.
# Generate an example dateset
sin = lambda x: np.sin(2 * np.pi * x)
x_train, y_train = generate_toy_data(sin, sample_size=100, std=0.3)
x_test, y_test = generate_toy_data(sin, sample_size=100, std=0.3)
# Create polynomial features
polynomial = PolynomialFeature(degree=5)
x_train_features = polynomial.transform(x_train)
x_test_features = polynomial.transform(x_test)
# Fit linear regression using both gradient descent and least squares
model = LinearRegression()
model.fit_lms(x_train_features, y_train, eta=0.1)
y_pred_lms, _ = model.predict(x_test_features)
model.fit(x_train_features, y_train)
y_pred_ls, _ = model.predict(x_test_features)
plt.scatter(x_test, y_test, facecolor="none", edgecolor="b", s=50, label="Test data")
plt.plot(x_test, y_pred_lms, color="r", label="LMS (stochastic gradient descent)")
plt.plot(x_test, y_pred_ls, color="g", label="Least squares solution")
plt.xlabel("$x$", fontsize=14)
plt.ylabel("$y$", fontsize=14)
plt.legend(bbox_to_anchor=(1, 0.7), loc=2, borderaxespad=1, fontsize=14)
plt.show()
In Chapter 1, we introduced the idea of adding a regularization term to the error function in order to control over-fitting. To that end, the total error function to be minimized takes the form,
ED(w)+λEW(w)where λ is the regularization coefficient that controls the balance between the data-dependent error ED(w) and the regularization term over the parameters EW(w). One of the simplest regularizers we can employ is given by the sum-of-squares of the weight vector
EW(w)=12wTwCombining the sum-of-squares error function for the data and the quadratic regularizer leads to the following total error function
12N∑n=1{tn−wTϕ(xn)}2+λ2wTwThis particular regularizer is also known as weight decay since it encourages weight values to decay towards zero, unless supported by data. In statistics, it provides an example of a parameter shrinkage method. Moreover, it has the advantage that the error function remains quadratic over w, thus, it can be minimized in closed form using calculus.
A more general formulation for the regularized error is given by
EW(w)=λ2M∑j=1|wj|qwhere q=2 recovers the quadratic regularizer. The case of q=1 is known as the lasso in the literature. It has the property that if λ is sufficiently large, some of the parameters wj are driven to zero, leading to sparse models in which the corresponding basis function plays no role.
Regularization allows complex models (having a large number of parameters) to be trained on data sets of limited size, avoiding over-fitting. Unfortunately, the problem of determining the optimal model is then shifted from finding the appropriate number of basis functions to determining a suitable value for the regularization coefficient λ.
The introduction of regularization terms can control over-fitting for models having many parameters. However, the question of how to determine a suitable value for the regularization coefficient λ remains. Seeking the solution that minimizes the regularized error function with respect to both the weight vector and the regularization coefficient λ is not the right approach since it leads to the unregularized solution λ=0.
The optimal prediction for the squared loss function is given by the conditional expectation
h(x)=E[t|x]=∫tp(t|x)dtMoreover, the expected squared loss is given by
E[L]=∫{y(x)−h(x)}2p(x)dx+∫∫{h(x)−t}2p(x,t)dxdtThe second term is independent of y(x) and arises from the intrinsic noise on the data. Assuming we can find a function y(x)=h(x), the second term represents the minimum achievable value of the expected loss. Thus, our goal is indeed to find a function y(x) that makes the first term a minimum, ideally zero, since the loss function is always non-negative.
A frequentist treatment of the problem involves making an estimate of w based on a data set D. In order to interpret the uncertainty of this estimate consider the following thought experiment. Suppose we had a large number of data sets each of size N, drawn independently from p(t,x). Then, considering each dataset in turn, we can obtain a prediction function y(x;D). As expected, each dataset should give a different functions and consequently different values for the squared loss. The performance of a particular learning algorithm can then be assesed by taking the average over this ensemble of data sets.
Now consider the integrand of the first them, given a particular dataset,
{y(x;D)−h(x)}2If we add and subtract the average of y(x;D) over the ensemble of data sets, expressed as ED[y(x;D)], we obtain
{y(x;D)−ED[y(x;D)]+ED[y(x;D)]−h(x)}2={y(x;D)−ED[y(x;D)]}2+{ED[y(x;D)]−h(x)}2−2{y(x;D)−ED[y(x;D)]}{ED[y(x;D)]−h(x)}By taking the expectation over all data sets D on both sides of the equation, we obtain
ED[{y(x;D)−h(x)}2]=ED[{y(x;D)−ED[y(x;D)]}2+{ED[y(x;D)]−h(x)}2−2{y(x;D)−ED[y(x;D)]}{ED[y(x;D)]−h(x)}]=ED[{y(x;D)−ED[y(x;D)]}2]+ED[{ED[y(x;D)]−h(x)}2]−ED[2{y(x;D)−ED[y(x;D)]}{ED[y(x;D)]−h(x)}]=ED[{y(x;D)−ED[y(x;D)]}2]+ED[{ED[y(x;D)]−h(x)}2]−ED[2{y(x;D)ED[y(x;D)]−y(x;D)h(x)−ED[y(x;D)]2+ED[y(x;D)]h(x)}]=ED[{y(x;D)−ED[y(x;D)]}2]+ED[{ED[y(x;D)]−h(x)}2]−2ED[y(x;D)]2+2ED[y(x;D)]h(x)+2ED[y(x;D)]2−2ED[y(x;D)]h(x)=ED[{ED[y(x;D)]−h(x)}2]+ED[{y(x;D)−ED[y(x;D)]}2]={ED[y(x;D)]−h(x)}2+ED[{y(x;D)−ED[y(x;D)]}2]Note that the last term vanished giving a sum of two terms for the expected squarred loss. The first term, called the squared bias, represents the extend to which the average prediction over all data sets differs from the optimal loss function h(x). The second term, called variance, measures the extend to which the solutions for each data set vary around their average, in other words, it measures how sensitive is function y(x;D) to the particular choice of data set.
By substituting this decomposision of the squared loss back into the expected squared loss, we note that
expected loss=(bias)2+variance+noiseOur goal is to minimize the expected loss, thus, minimizing both the bias and the variance. However, there is a trade-off between bias and variance. Very flexible models have low bias and high variance, while relatively rigid models have high bias and low variance. The best model is the one that balances these two quantities.
Consider L=100 data sets, each containing N=25 data points, independently from the sinusoidal curve h(x)=sin(2πx). For each data set D(l), a model using 24 Gaussian basis functions is trained, by minimizing the regularized error function to give a prediction function y(l).
L = 100 # number of datasets
N = 25 # number of points per dataset
# the optimal regression function is the sinusoidal
def h(x):
return np.sin(2 * np.pi * x)
gaussian_basis = GaussianFeature(np.linspace(0, 1, 24), 0.1)
# create a test set
x_test = np.linspace(0, 1, 1000)
x_test_features = gaussian_basis.transform(x_test)
y_test = h(x_test)
# create L datasets
datasets = []
for i in range(L):
x_train, y_train = generate_toy_data(h, N, 0.3)
x_train_features = gaussian_basis.transform(x_train)
datasets.append((x_train_features, y_train))
# apply ridge regression on the L datasets
for ln_lambda in [2.4, -2.4, -8]:
predictions = []
plt.figure(figsize=(20, 5))
plt.subplot(1, 2, 1)
for i, (x_train, y_train) in enumerate(datasets):
model = RidgeRegression(alpha=math.exp(ln_lambda))
model.fit(x_train, y_train)
y, _ = model.predict(x_test_features)
predictions.append(y)
if i == 0:
plt.plot(x_test, y, color="red", label=f"$\ln\lambda={ln_lambda}$")
plt.ylim(-1.5, 1.5)
elif i < 20:
plt.plot(x_test, y, color="red")
plt.legend(fontsize=14)
plt.subplot(1, 2, 2)
plt.plot(x_test, y_test)
plt.plot(x_test, np.asarray(predictions).mean(axis=0))
plt.ylim(-1.5, 1.5)
plt.show()
The top row corresponds to a larger value for the regularization coefficient λ and results in low variance but high bias. The bottom row for which λ is small, there is a large variance but low bias.
Note that the result of averaging many solutions for a complex model (M=25) is very good, suggesting that averaging may be a beneficial procedure. Indeed, the weighted average of multiple solutions lies at the core of a Bayesian approach, although the averaging is done respect to the posterior distribution of the parameters, not to multiple datasets.
In order to tackle the over-fitting of maximum likelihood, we turn to the Bayesian treatment of linear regression that leads to automatic methods for determining model complexity using the training data alone.
We begin our discussion of the Bayesian treatment of linear regression by introducing a prior probability distribution over the model parameters w. Assume for the moment, that the noise precision parameter β is a known constant. Note that the likelihood function p(t|X,w) is the exponential of a quadratic function of w. Thus, the corresponding conjugate prior is given by a Gaussian distribution of the form
p(w)=N(w|,m0,S0)The posterior distribution is proportional to the product of the likelihood and the prior. Due to the choice of a conjugate Gaussian prior distribution, the posterior is also Gaussian. Thus, to derive the form of the posterior, we focus on the exponential term
exponential term=−β2N∑n=1{tn−wTϕ(xn)}2−12(w−m0)TS−10(w−m0)=−β2N∑n=1{t2n−2tnwTϕ(xn)−wTϕ(xn)ϕ(xn)Tw}−12(wTS−10w−2mT0S−10w+mT0S−10m0)=−12wT(N∑n=1βϕ(xn)ϕ(xn)T+S−10)w−12wT(−2S−10m0−N∑n=12βtnϕ(xn))−12(N∑n=1βt2n+mT0S−10m0)=−12wT(N∑n=1βϕ(xn)ϕ(xn)T+S−10)w+wT(S−10m0+N∑n=1βtnϕ(xn))−12(N∑n=1βt2n+mT0S−10m0)−12(x−μ)TΣ−1(x−μ)=−12xTΣ−1x+xTΣ−1μ−12μTΣ−1μ*Completing the square* is a common operation for Gaussian distributions, where given a quadratic form defining the exponent terms in a Gaussian distribution, and we seek to determine the corresponding mean and covariance. Such problems can be solved easily by noting that the exponent in a general Gaussian distribution N(x|μ,Σ) can be formulated as
then, we can equate the matrix of coefficients in the second order term to the inverse covariance matrix Σ−1 and the coefficient of the linear term to Σ−1μ, in order to obtain μ.
Hence, by comparing the quadratic term to the standard Gaussian Distribution we obtain
S−1N=S−10+N∑n=1βϕ(xn)ϕ(xn)T=S−10+βΦTΦThen, by comparing the linear term we obtain
S−1NmN=S−10m0+N∑n=1βtnϕ(xn)=S−10m0+βΦTt⇔mN=SN(S−10m0+βΦTt)Therefore, the posterior distribution is given by
p(w|X,t)=N(w|mN,SN)where
mN=SN(S−10m0+βΦTt)S−1N=S−10+βΦTΦConsider now, for the sake of simplicity, a zero-mean isotropic Gaussian prior governed by a single parameter α so that,
p(w|α)=N(w|0,α−1I)Then, the corresponding posterior distribution over w becomes
mN=βSNΦTtS−1N=αI+βΦTΦFor this particular choice of prior, the maximization of the log of the posterior is equivalent to the minimization of the regularized sum-of-squares error function.
alpha = 2.0
beta = 25.0
N = 20
f = lambda x: 0.5 * x - 0.3
x_train, y_train = generate_toy_data(f, N, 0.2, (-1, 1))
x_train_linear = PolynomialFeature(degree=1).transform(x_train)
w0, w1 = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
w = np.array([np.ravel(w0), np.ravel(w1)])
# Prior for the parameters (assuming zero-mean isotropic Gaussian)
prior_posterior = MultivariateGaussian(mu=np.zeros((2, 1)), cov=(1 / alpha) * np.eye(2))
for i, x in enumerate(x_train_linear):
# compute the posterior according to (3.50) and (3.51)
if i > 0:
prev_precision = np.linalg.inv(prior_posterior.cov)
cur_precision = prev_precision + beta * x_train_linear.T @ x_train_linear
cur_cov = np.linalg.inv(cur_precision)
cur_mean = cur_cov @ ((prev_precision @ prior_posterior.mu).T + beta * x_train_linear.T @ y_train).reshape(
-1, 1
)
prior_posterior = MultivariateGaussian(mu=cur_mean, cov=cur_cov)
# create figures only for the three starting iterations and the last one
if i < 3 or i == N - 1:
plt.figure(figsize=(20, 5))
# likelihood
plt.subplot(1, 3, 1)
if i == 0:
plt.title("Likelihood", fontsize=14)
else:
likelihood = Gaussian(var=1 / beta).pdf(y_train[i])
z = likelihood.pdf(mu=w.T @ x).reshape(w0.shape)
plt.contourf(w0, w1, z, cmap="rainbow")
plt.scatter(-0.3, 0.5, s=200, marker="x", color="white") # optimal parameter vector
plt.xlabel("$w_0$", fontsize=14)
plt.ylabel("$w_1$", fontsize=14)
# prior/posterior
plt.subplot(1, 3, 2)
plt.title("Prior/Posterior", fontsize=14)
z = np.diag(prior_posterior.pdf(w)).reshape(w0.shape)
plt.contourf(w0, w1, z, cmap="rainbow")
plt.scatter(-0.3, 0.5, s=200, marker="x", color="white") # optimal parameter vector
plt.xlabel("$w_0$", fontsize=14)
plt.ylabel("$w_1$", fontsize=14)
# data space
plt.subplot(1, 3, 3)
if i == 0:
plt.title("Data space", fontsize=14)
else:
plt.scatter(x_train[:i], y_train[:i], s=100, color="steelblue")
w_sample = prior_posterior.draw(6)
y_sample = x_train_linear @ w_sample.T
plt.plot(x_train, y_sample, color="red")
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.xlabel("$x$", fontsize=14)
plt.ylabel("$y$", fontsize=14)
plt.show()
/Users/vagmcs/Work/dev/personal/prml/prml/distribution/multivariate_gaussian.py:98: RuntimeWarning: overflow encountered in exp * np.exp(-0.5 * (np.linalg.solve(self.cov, d).T.dot(d))) /Users/vagmcs/Work/dev/personal/prml/prml/distribution/multivariate_gaussian.py:96: RuntimeWarning: overflow encountered in multiply 1
In practice however, our goal is make predictions of t for unseen values of xunseen, and thus, we are not actually interested in the value of w itself. To that end, we evaluate the predictive distribution given by
p(t|xunseen,t,X,α,β)=∫p(t|xunseen,w,β)p(w|t,X,α,β)dwNote that the predictive distribution involves the convolution of the conditional Gaussian distribution of the target variable and the posterior weight Gaussian distribution.
p(t|x,w,β)=N(t|y(x,w),β−1)=N(t|ϕ(x)Tw,β−1)p(w|t,X,α,β)=N(w|mN,SN)Taking advantage of (2.113), (2.114) and (2.115), we can obtain
p(t|xunseen,t,X,α,β)=N(t|mTNϕ(xunseen),σ2N(xunseen))where the variance σ2N(x) of the predictive distribution is given by
σ2N(x)=1β+ϕ(x)TSNϕ(x)The first term represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w.
N = 25
sinusoidal = lambda x: np.sin(2 * np.pi * x)
x_train, y_train = generate_toy_data(sinusoidal, N, 0.25)
x_test = np.linspace(0, 1, 100)
y_test = sinusoidal(x_test)
feature = GaussianFeature(np.linspace(0, 1, 9), 0.1)
X_train = feature.transform(x_train)
X_test = feature.transform(x_test)
fig_idx = 1
plt.figure(figsize=(20, 6))
for i, x in enumerate(X_train):
if i < 4 or i == 7 or i == N - 1:
model = BayesianRegression(alpha=1e-3, beta=2.0)
model.fit(X_train[: i + 1], y_train[: i + 1])
y, y_std = model.predict(X_test)
plt.subplot(2, 6, fig_idx)
fig_idx += 1
plt.scatter(x_train[: i + 1], y_train[: i + 1], s=100, facecolor="none", edgecolor="steelblue")
plt.plot(x_test, y_test)
plt.plot(x_test, y)
plt.fill_between(x_test, y - y_std, y + y_std, color="coral", alpha=0.25)
plt.xlim(0, 1)
plt.ylim(-2, 2)
for i, x in enumerate(X_train):
if i < 4 or i == 7 or i == N - 1:
model = BayesianRegression(alpha=1e-3, beta=2.0)
model.fit(X_train[: i + 1], y_train[: i + 1])
w_samples = model.draw(6)
plt.subplot(2, 6, fig_idx)
fig_idx += 1
plt.scatter(x_train[: i + 1], y_train[: i + 1], s=100, facecolor="none", edgecolor="steelblue")
plt.plot(x_test, y_test)
plt.plot(x_test, X_test @ w_samples.T, color="red")
plt.xlim(0, 1)
plt.ylim(-2, 2)
plt.show()
The posterior mean solution for the linear basis function model has an interesting interpretation. If we substitute the posterior mean solution mN, given by (3.53), into the expression (3.3), we note that the predictive mean can be expressed in the form
y(x,mN)=mTNϕ(x)=βϕ(x)TSNΦTt=N∑n=1βϕ(x)TSNϕ(xn)tnThus the mean of the predictive distribution at point x is given by a linear combination of the training data set target variables tn.
y(x,mN)=N∑n=1k(x,xn)tnwhere the function
k(x,x′)=βϕ(x)TSNϕ(x′)is known as the smoother matrix or the equivalent kernel. Regression functions that make predictions by taking linear combinations of the training set target values are known as linear smoothers.
Further insight into the equivalent kernel can be obtained by considering the covariance between y(x) and y(x′),
cov[y(x),y(x′)]=cov[ϕ(x)Tw,wTϕ(x)]=ϕ(x)TSNϕ(x′)=β−1k(x,x′)Therefore, the predictive mean at nearby points is highly correlated, whereas for more distant pairs of points the correlation is smaller.
!!! This formulation of linear regression suggests an alternative approach to regression as follows. Instead of introducing a set of basis functions ϕj to derive an equivalent kernel, we may instead define the kernel k(x,x′) directly and use the it to make predictions, given an observed training set. This leads to a practical framework called Gaussian processes.
In a fully Bayesian treatment of the linear basis function model one can introduce prior distributions over the hyperparameters α and β and make predictions by marginalizing over these hyperparameters in addition to the model parameters w. However, the complete marginalization over these variables is analytically intractable. A useful approximation is to set the hyperparameters to specific values by maximizing the marginal likelihood obtained by integrating over the model parameters w. This framework is known in statistics as empirical bayes or type 2 maximum likelihood or generalized maximum likelihood. In machine learning is also called evidence approximation.
The predictive distribution is obtained by marginalizing over w, α and β so that
p(t|˜x,t,X)=∫∫∫p(t|w,˜x,β)p(w|t,X,α,β)p(α,β|t,X)dwdαdβAssuming that the posterior distribution p(α,β|t,X) is sharply peaked around some values ˆα and ˆβ, then the predictive distribution is obtained by fixing α and β to these values and marginalizing over w
p(t|˜x,t,X)≈p(t|˜x,t,X,ˆα,ˆβ)=∫p(t|˜x,w,ˆβ)p(w|t,X,ˆα,ˆβ)dwin which case we arrive at the simpler predictive distribution defined by (3.57).
The posterior distribution for α and β is given by
p(α,β|t,X)∝p(t|X,α,β)p(α,β)Further assuming that the prior is relatively flat, that is, our prior belief is that different values of α and β are somewhat equiprobable. Then, the sharply peaked area assumed for the posterior (ˆα and ˆβ) should be found by maximizing the marginal likelihood function p(t|X,α,β).
The marginal likelihood function p(t|X,α,β) is obtained by integrating over the model parameters w, so that,
p(t|X,α,β)=∫p(t|X,w,β)p(w|α)dwFrom (3.10) we have that
p(t|X,w,β)=N∏n=1N(tn|wTϕ(xn),β−1)=N∏n=11(2πβ−1)1/2exp{−12β−1(tn−wTϕ(xn))2}=N(β2π)1/2exp{N∑n=1−12β−1(tn−wTϕ(xn))2}=(β2π)N/2exp{−β2‖t−Φw‖2}From (3.52) we have that
p(w|α)=N(w|0,α−1I)=αM/2(2π)M/2exp{−α2‖w‖2}Substituting both these quantities back into the integral of the marginal likelihood, we have
p(t|X,α,β)=(β2π)N/2(α2π)M/2∫exp{−β2‖t−Φw‖2−α2‖w‖2}dwUsing (3.25) and (3.26) we obtain
p(t|X,α,β)=(β2π)N/2(α2π)M/2∫exp{−E(w)}dwwhere
E(w)=βED(w)+αEW(w)Completing the square over w, we obtain
E(w)=β2‖t−Φw‖2+α2‖w‖2=β2(tTt−2tTΦw+wTΦTΦw)+α2wTw=12(wT(αI+βΦTΦ)w−2βtTΦw+βtTt)where
A=S−1N=αI+βΦTΦmTNS−1N=βtTΦ⇔mTN=βSNtTΦ⇔mN=βSNΦTt=βA−1ΦTtBy exploiting the fact A−1A=I and adding 0=mTNAmN−mTNAmN, we can further derive that
E(w)=12(wT(αI+βΦTΦ)w−2βtTΦw+βtTt)=12(wTAw−2βtTΦA−1Aw+βtTt)=12(wTAw−2mNAw+βtTt+mTNAmN−mTNAmN)=12(βtTt−mTNAmN)+12(w−mN)TA(w−mN)At this point note that the first term is independent of the model parameters w and the second term is an exponent of a Gaussian distribution over the model parameters. Therefore the integral over w of p(t|X,α,β) is given by
p(t|X,α,β)=(β2π)N/2(α2π)M/2exp{−12(βtTt−mTNAmN)}∫exp{−12(w−mN)TA(w−mN)}dwHowever, based on the standard form of a multivariate normal distribution, we know that
∫1(2π)M/21|A|1/2exp{−12(w−mN)TA(w−mN)}dw=1⇔∫exp{−12(w−mN)TA(w−mN)}dw=(2π)M/2|A|−1/2Thus,
p(t|X,α,β)=(β2π)N/2(α2π)M/2exp{−12(βtTt−mTNAmN)}(2π)M/2|A|−1/2Thus, the log of the marginal likelihood is given by
logp(t|X,α,β)=N2log(β2π)+M2log(α2π)−12(βtTt−mTNAmN)+M2log(2π)−12|A|=M2logα+N2logβ−12(βtTt−mTNAmN)−12|A|−N2log(2π)As a final step we are about to show that 12(βtTt−mTNAmN)=E(w)
12(βtTt−mTNAmN)=12(βtTt−2mTNAmN+mTNAmN)(3.81)=12(βtTt−2mTNAmN−mTN(αI+βΦTΦ)mN)(3.84)=12(βtTt−2βtTΦA−1AmN−mTN(αI+βΦTΦ)mN)=12(βtTt−2βtTΦmN−mTN(βΦTΦ)mN)+α2mTNmN=12‖t−ΦmN‖2+α2mTNmN=E(w)Lets consider the maximization of p(t|X,α,β) over α,
ddαp(t|X,α,β)=ddαM2logα−ddαE(w)−ddα12log|A|=M2α−12mTNmN−12ddαln|A|In order to find the derivative of ln|A|, we use the eigenvector equation of (C.29)
(βΦTΦ)ui=λiui(C.30)⇔|βΦTΦ−λiI|=0(3.81)⇔|A−αI−λiI|=0⇔|A−(λi+α)I|=0Therefore, we can derive that A has eigenvalues λi+α. In conclusion, the derivative of the term involving ln|A| is given by
ddαln|A|(C.47)=ddαlnM∏i=1(λi+α)=ddαM∑i=1ln(λi+α)=M∑i=11λi+αThus, the resulting solution for α, as presented in eq. (3.92), is given by
0=M2α−12mTNmN−12M∑i=11λi+α⇔12mTNmN=M2α−12M∑i=11λi+α×2α⇔αmTNmN=M−αM∑i=11λi+α⇔αmTNmN=Mλi+αλi+α−M∑i=1αλi+α⇔αmTNmN=M∑i=1λiλi+α=γ⇔α=γmTNmNNote that this is an implicit solution for α, since both γ and mN depend on α. Thus, we have to use an iterative procedure to estimate α by making an starting choice for the value of α, computing mN, evaluating γ (3.91) and re-estimating α (3.92), until convergence. It should be emphasized that the value of α can be determined purely by looking at the training data. No independent dataset is required in order to optimize model complexity.
The maximization of p(t|X,α,β) over β is given by
ddβp(t|X,α,β)=ddβN2logβ−ddβE(w)−ddβ12log|A|=N2β−ddβE(w)−12ddβln|A|Lets take a closer look to the second term,
ddβE(w)=ddββ2‖t−ΦmN‖2+ddβα2mTNmNproduct rule=12‖t−ΦmN‖2+β2ddβ‖t−ΦmN‖2+ddβα2mTNmN×dmN/dmN=12‖t−ΦmN‖2+(β2ddmN‖t−ΦmN‖2+ddmNα2mTNmN)dmNdβ=12‖t−ΦmN‖2+(β2(−2ΦT(t−ΦmN))+α22mN)dmNdβ=12‖t−ΦmN‖2+(−βΦT(t−ΦmN)+αmN)dmNdβ=12‖t−ΦmN‖2+(−βΦTt−(αI+ΦTΦ)mN)dmNdβ(3.81)=12‖t−ΦmN‖2+(−βΦTt−AmN)dmNdβ(3.84)=12‖t−ΦmN‖2=12N∑n=1{tn−mTNϕ(xn)}2The last term involving the derivative of ln|A| becomes
ddβln|A|=ddβM∑i=1ln(λi+α)=M∑i=11λi+αddβλi=1βM∑i=1λiλi+α=γβFinally, if we combine all these expressions together, we obtain
0=Ν2β−12N∑n=1{tn−mTNϕ(xn)}2−γ2β⇔1β=1N−γN∑n=1{tn−mTNϕ(xn)}2As expected, the solution for β is also implicit. If both α and β are to be determined, then their values can be re-estimated together after each update of γ.
cubic = lambda x: x * (x - 5) * (x + 5)
x_train, y_train = generate_toy_data(cubic, 30, 10, [-5, 5])
x_test = np.linspace(-5, 5, 100)
models = []
evidences = []
for i in range(8):
feature = PolynomialFeature(degree=i)
X_train = feature.transform(x_train)
model = EvidenceApproximation(alpha=100.0, beta=100.0)
model.fit(X_train, y_train, n_iter=100)
evidences.append(model.log_evidence(X_train, y_train))
models.append(model)
# select the best performing degree (the one having the highest evidence)
degree = np.nanargmax(evidences)
regression = models[degree]
X_test = PolynomialFeature(degree=int(degree)).transform(x_test)
y, y_std = regression.predict(X_test)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x_train, y_train, s=50, facecolor="none", edgecolor="steelblue", label="observations")
plt.plot(x_test, cubic(x_test), label="x(x-5)(x+5)")
plt.plot(x_test, y, label=f"prediction (degree={degree})")
plt.fill_between(x_test, y - y_std, y + y_std, alpha=0.5, label="standard deviation", color="orange")
plt.xlabel("x", fontsize=12)
plt.ylabel("y", fontsize=12)
plt.legend(fontsize=12)
plt.subplot(1, 2, 2)
plt.plot(evidences)
plt.title("Model evidence", fontsize=12)
plt.xlabel("Degree", fontsize=12)
plt.ylabel("Log evidence", fontsize=12)
plt.show()
The left figure presents the fitted function estimated by evidence approximation for degree 3. The figure in the right depicts the log evidence of each polynomial feature degree. Note that the highest evidence is achieved for polynomial feature of degree 3.