import math
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from prml.preprocessing import LinearFeature
from prml.linear import LeastSquaresClassifier
from prml.linear import FisherLinearDiscriminant
from prml.linear import Perceptron
from prml.linear import GenerativeClassifier
from prml.linear import LogisticRegression
from prml.linear import SoftmaxRegression
from prml.linear import BayesianLogisticRegression
from prml.distribution import Gaussian
# Set random seed to make deterministic
np.random.seed(0)
# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")
# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'
# Enable autoreload all modules before executing code
%load_ext autoreload
%autoreload 2
The goal in classification is to take an input vector x and assign it to one of K discrete classes Ck, where k=1…,K. The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces. Linear models define decision surfaces as linear functions of the input vector x and hence are defined by (D−1)-dimensional hyperplanes inside the D-dimensional space. Datasets whose classes can be separated exactly by linear decision surfaces are called linearly separable.
There are three distinct approaches to the classification problem:
In the linear regression models, the model prediction y(x,w) was given by a linear function of the parameters w. For classification problems, however, we wish to predict discrete class labels. To that end, we consider a generalization of the above model in which we transform the linear function using a nonlinear function f(⋅) so that
y(x)=f(wTx+w0)In machine learning, the function f is known as an activation function.
A discriminant is a function that assigns one of K classes to an input vector x. Linear discriminants define decision surfaces that are hyperplanes.
The simplest linear discriminant function is obtained by taking a linear function of the input vector so that,
y(x)=wTx+w0where w is a weight vector and w0 is a bias (the negative of the bias is also called threshold). Then, an input x is assigned to a class C1 if y(x)≥0 and to class C2 otherwise. Thus, the decision boundary is defined by y(x)=0.
Consider two points xA and xB onto the decision surface. Then, y(xA)=y(xB)=0⇔wT(xA−xB)=0, which implies that the vector w is orthogonal to every vector lying in the decision surface as depicted below:
Note that for more than two classes (K>2), a one-vs-the-rest classifier can be used in order to avoid regions of input space that are ambiguously classified. The linear function of each class takes the form yk(x)=wTkx+wk0, and assigns a point x to class Ck if yk(x)>yj(x)∀j≠k.
In the following sections we explore three approaches to learning the parameters of linear discriminant functions:
In Chapter 3, we minimized the sum-of-squared error function led to a closed-form solution for the parameter values. Can we apply the same principle to classification problems?
Consider a general classification problem having K classes, using a 1-of-K binary coding scheme or one-hot encoding for the target vector. Each class Ck is described by its own linear model yk. We can group these models together using vector notation so that
y(x)=˜WT˜xwhere ˜W is a matrix whose kth column comprises the D+1-dimensional vector ˜wk=(wk0,wTk)T and ˜x is the augmented vector (1,˜xT)T. The parameter matrix ˜W is determined by minimizing the sum-of-squares error function, as presented in Chapter 3. Thus, the solution for ˜W is obtained from
˜W=(˜XT˜X)−1˜XTT# number of training points
N = 100
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T
feature = LinearFeature()
x_train_linear = feature.transform(x_train)
x_test_linear = feature.transform(x_test)
model = LeastSquaresClassifier()
model.fit(x_train_linear, t)
predicted = model.predict(x_test_linear)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.show()
The least-squares approach gives an exact closed-form solution for the discriminant function parameters. However, even as a discriminant function (making decisions directly) it suffers from some problems. We already know that least-squares solutions lack robustness to outliers, and this applies equally to classification, as depicted in the following figure. Note that the additional outlier data points produce a change in the location of the decision boundary, even though these point would be correctly classified by the original decision boundary. The sum-of-squares error function penalizes predictions that are too correct in that they lie a long way on the correct side of the decision boundary.
# number of training points
N = 100
# number of outlier points
n_outliers = 5
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=12
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T
outliers = np.random.random_sample((n_outliers, 2)) + 3
x_train_outliers = np.vstack((x_train, outliers))
t_outliers = np.hstack((t, np.ones(n_outliers, dtype=int)))
feature = LinearFeature()
x_train_linear = feature.transform(x_train)
x_train_linear_outliers = feature.transform(x_train_outliers)
x_test_linear = feature.transform(x_test)
model = LeastSquaresClassifier()
model.fit(x_train_linear, t)
predicted = model.predict(x_test_linear)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Original decision boundary")
model.fit(x_train_linear_outliers, t_outliers)
predicted_outliers = model.predict(x_test_linear)
plt.subplot(1, 2, 2)
plt.scatter(x_train_outliers[:, 0], x_train_outliers[:, 1], c=t_outliers)
plt.contourf(x1, x2, predicted_outliers.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Outliers decision boundary")
plt.show()
The failure of least squares should not surprise us since it corresponds to maximum likelihood under the assumption of a Gaussian conditional distribution, whereas binary target vectors clearly do not have a Gaussian distribution.
Consider a two-class problem in which there are N1 points of class C1 and N2 points from class C2, so that the mean vectors of the two classes are given by
m1=1N1∑n∈C1xn,m2=1N2∑n∈C2xnThen, the simplest measure of separation of the classes, when projected onto w, is the separation of the projected class means. This suggests that we might choose w so as to maximize
m2−m1=wT(m2−m1)where
mk=wTmkis the mean of the projected data from class Ck.
The idea proposed by Fisher is to maximize the function that gives a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The within-class variance of the projected data from class Ck is given by,
s2k=∑n∈Ck(yn−mk)2where yn=wTxn is the projected data point in the one-dimentional space. We can further define the total within-class variance for the whole data set to be simply s21+s22. Then, the Fisher criterion is defined as the ratio of the between-class variance to the within-class variance as follows,
J(w)=(m2−m1)2s21+s22In order to explicitly show the dependence on w, we may rewrite J(w), using (4.20), (4.23), and (4.24), as follows,
J(w)=(m2−m1)2s21+s22=(wTm2−wTm1)2∑n∈Ck(yn−m1)2+∑n∈Ck(yn−m2)2=(wT(m2−m1))2∑n∈Ck(wT(xn−m1))2+∑n∈Ck(wT(xn−m2))2=wT(m2−m1)2wwT(∑n∈Ck(xn−m1)2+∑n∈Ck(xn−m2)2)w=wTSBwwTSWwThen, by computing the derivative with respect to w, we find that J(w) is maximized when,
(wTSBw)SWw=(wTSWw)SBw⇔w∝S−1W(m2−m1)This result is known as the Fisher's linear discriminant. To that end, the projected data are compared against a threshold y0 and classified as belonging to class C1 if y(x)≥y0, and to class C2 otherwise.
# number of training points
N = 100
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=15
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.title("Training data (2 classes)")
plt.show()
One way of choosing the threshold y0 is to model the class-conditional densities (one per class) p(y|Ck) as Gaussian distributions, then estimate their parameters using maximum likelihood, and finally, estimate the optimal threshold using decision theory. In the case of binary classification, we can equate the Gaussian functions and solve for x. The result is a quadratic equation having coefficients relating to the gaussian means and variances.
# split data points according to the classes
x_0 = x_train[t == 0]
x_1 = x_train[t == 1]
model = FisherLinearDiscriminant()
model.fit(x_train, t)
# create a Gaussian distribution per class
g0 = Gaussian()
g0.ml(x_0 @ model._w)
g1 = Gaussian()
g1.ml(x_1 @ model._w)
root = np.roots(
[
g1.var - g0.var,
2 * (g0.var * g1.mu - g1.var * g0.mu),
g1.var * g0.mu**2 - g0.var * g1.mu**2 - g1.var * g0.var * np.log(g1.var / g0.var),
]
)
x = np.linspace(-5, 5, N)
plt.plot(x, g0.pdf(x), "springgreen")
plt.plot(x, g1.pdf(x), "firebrick")
plt.plot(root[0], g0.pdf(root[0]), "ko")
plt.plot(root[1], g0.pdf(root[1]), "ko")
plt.plot(np.zeros(x.size) + root[1], np.linspace(0, 1, N), "k--")
plt.title("Intersection point of Gaussian curves")
plt.show()
The figures below compares the optimal threshold against the naive zero threshold. Note that the selection of the decision threshold is crucial for an effective model.
model = FisherLinearDiscriminant()
model.fit(x_train, t)
optimal_threshold = model._threshold
plt.figure(figsize=(15, 5))
# predict classes using the optimal threshold
predicted = model.predict(x_test)
plt.subplot(1, 2, 2)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.title(f"Optimal threshold ({round(optimal_threshold, 2)})")
# set threshold to zero and make predictions
model._threshold = 0
predicted = model.predict(x_test)
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.title("Zero threshold (suboptimal)")
plt.show()
Another linear discriminant model is the perceptron. It corresponds to a two-class model in which the input vector x is first transformed using a nonlinear transformation to give a feature vector ϕ(x), and then use it to construct a generalized linear model of the form
y(x)=f(wTϕ(x))We assume that the vector ϕ(x) typically includes the bias component ϕ0. The nonlinear activation function f(⋅) is given by a step function of the form
f(α)={+1,α≥0−1,α<0That is because for the perceptron it is more convenient to use target values t=+1 for class C1 and t=−1 for class C2, instead of t∈{0,1}. We consider an error function called the perceptron criterion. Note that we are seeking a weight vector w, such that the inputs xn, belonging in class C1, have wTϕ(xn)>0, whereas the ones belonging in class C2, have wTϕ(xn)<0. Given the coding scheme t∈{−1,+1}, it follows that all inputs must satisfy wTϕ(xn)tn>0. Thus, the perceptron criterion tries to minimize the quantity −wTϕ(xn)tn, for all misclassified inputs. More formally,
EP(w)=−∑n∈MwTϕ(xn)tnwhere M denotes the set of misclassified patterns.
We apply the stochastic gradient descent algorithm to the error function. Thus, the change in the weight vector, according to gradient descent, is given by,
w(τ+1)=w(τ)−η∇EP(w)=w(τ)+ηϕntnThe perceptron convergence theorem states that if there exists an exact solution (the data are linearly separable), the the perceptron algorithm is guaranteed to find the solution in a finite number of steps. On the other hand, if the data are not linearly separable the perceptron never converges. Moreover, note that the perceptron does not provide probabilistic outputs, nor does it generalize to K>2 classes.
Another important limitation arises from the fact that it is based on linear combinations of fixed basis functions.
# number of training points
N = 100
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = Perceptron()
plt.figure(figsize=(15, 5))
x_train, t = make_classification(
n_features=2,
n_informative=2,
n_redundant=0,
n_classes=2,
n_clusters_per_class=1,
n_samples=N,
random_state=10,
class_sep=2,
)
model.fit(x_train, np.where(t == 0, -1, 1))
predicted = model.predict(x_test)
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.title("Linearly separable data")
x_train, t = make_classification(
n_features=2,
n_informative=2,
n_redundant=0,
n_classes=2,
n_clusters_per_class=1,
n_samples=N,
random_state=14,
)
model.fit(x_train, np.where(t == 0, -1, 1))
predicted = model.predict(x_test)
plt.subplot(1, 2, 2)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.title("Non-linearly separable data")
plt.show()
Models having linear decision boundaries arise from simple assumptions about the distribution of the data. A generative approach models the class-conditional densities p(x|Ck), as well as the class priors p(Ck), and use them to compute the posterior probability p(Ck|x) throught Bayes theorem. To that end, the posterior probability for class C1, in a binary classification problem, is as follows,
p(C1|x)=p(x|C1)p(C1)p(x)=p(x|C1)p(C1)p(x|C1)p(C1)+p(x|C2)p(C2)=11+exp(−α)=σ(α)where,
α=lnp(x|C1)p(C1)p(x|C2)p(C2)Proof
σ(α)=11+exp(−α)=11+1exp(α)=exp(α)1+exp(α)=exp(lnp(x|C1)p(C1)p(x|C2)p(C2))1+exp(lnp(x|C1)p(C1)p(x|C2)p(C2))=p(x|C1)p(C1)p(x|C2)p(C2)1+p(x|C1)p(C1)p(x|C2)p(C2)=p(x|C1)p(C1)p(x|C2)p(C2)p(x|C1)p(C1)+p(x|C2)p(C2)p(x|C2)p(C2)=p(x|C1)p(C1)p(x|C2)p(C2)p(x|C2)p(C2)(p(x|C1)p(C1)+p(x|C2)p(C2))=p(x|C1)p(C1)p(x|C1)p(C1)+p(x|C2)p(C2)The function σ(α) is the logistic sigmoid briefly presented in Chapter 3.
For K>2 classes, the posterior for class Ck is as follows,
p(Ck|x)=p(x|Ck)p(Ck)p(x)=p(x|Ck)p(Ck)∑ip(x|Ci)=exp(αk)∑iexp(αi)which is known as the normalized exponential and can be regarded as a multiclass generalization of the logistic sigmoid function. The quantities αk are defined as follows,
αk=ln(p(x|Ck)p(Ck))The normalized exponential is also known as the softmax function, since it represents a smoothed version of the max function, because if αk≫αi∀i≠k, then p(Ck|x)≈1 and p(Ci|x)≈0.
Given the formulation above, the next step is to assume the form of the class-conditional densities. The Gaussian distributions may be used for modelling continuous variables. Assuming that all classes share the same covariance matrix, the density for class Ck is given by
p(x|Ck)=N(x|μk,Σ)Thus, from (4.58), we have,
α=lnp(x|C1)p(C1)p(x|C2)p(C2)=lnN(x|μ1,Σ)p(C1)N(x|μ2,Σ)p(C2)=lnN(x|μ1,Σ)N(x|μ2,Σ)+lnp(C1)p(C2)=lnexp{−12(x−μ1)TΣ−1(x−μ1)}exp{−12(x−μ2)TΣ−1(x−μ2)}+lnp(C1)p(C2)=−12(x−μ1)TΣ−1(x−μ1)+12(x−μ2)TΣ−1(x−μ2)+lnp(C1)p(C2)=−12xTΣ−1x+μT1Σ−1x−12μT1Σ−1μ1+12xTΣ−1x−μT2Σ−1x+12μT2Σ−1μ2+lnp(C1)p(C2)=μT1Σ−1x−μT2Σ−1x−12μT1Σ−1μ1+12μT2Σ−1μ2+lnp(C1)p(C2)=(μ1−μ2)TΣ−1x−12μT1Σ−1μ1+12μT2Σ−1μ2+lnp(C1)p(C2)To that end, using (4.57), we derive that,
p(C1|x)=σ(wTx+w0)where,
w=Σ−1(μ1−μ2)w0=−12μT1Σ−1μ1+12μT2Σ−1μ2+lnp(C1)p(C2)Note that the prior probabilities p(Ck) enter through the bias parameter w0, thus making parallel shifts of the decision boundary.
For the general case of K classes, from (4.63), we have,
αk=ln(1(2π)D/2)+ln(1|Σ|1/2)−12(x−μk)TΣ−1(x−μk)+lnp(Ck)=ln(1(2π)D/2)+ln(1|Σ|1/2)−12xTΣ−1x+μTkΣ−1x−12μTkΣ−1μk+lnp(Ck)=lnA+lnB+Q+wTkx+wk0where,
A=ln(1(2π)D/2)B=ln(1|Σ|1/2)Q=−12xTΣ−1xwk=Σ−1μkwk0=−12μTkΣ−1μk+lnp(Ck)Then using (4.62), we derive,
p(Ck|x)=exp(αk)∑jexp(αj)=exp(A+B+Q)exp(wTkx+wk0)exp(A+B+Q)∑jexp(wTjx+wj0)and re-define αk as follows,
ak(x)=wTkx+wk0Therefore, we see that for K>2 classes, αk are linear functions of x since quadratic terms cancel each other due to the shared covariances. By relaxing the assumption of the shared covariance matrix among the classes, allowing each class to have each won covariance matrix Σk, then we obtain quadratic functions of x, giving rise to quadratic discriminant.
Given a set of data, comprising observations x and corresponding class labels, we can determine the parameters of the class-conditional densities and class prior probabilities, using maximum likelihood. Suppose that we are given a dataset {x,tn}, where tn=1 denotes class C1 and tn=0 denotes C2. Then, for a data point xn belonging to class C1 (tn=1), we have,
p(xn,C1)=p(C1)p(xn|C1)=πN(xn|μ1,Σ)Similarly, for class C2 (tn=0),
p(xn,C2)=p(C2)p(xn|C2)=(1−π)N(xn|μ2,Σ)where p(C1)=π and complementary p(C2)=1−π.
Thus, the likelihood function is given by,
p(t,X|π,μ1,μ2,Σ)=N∏n=1[πN(xn|μ1,Σ)]tn[(1−π)N(xn|μ2,Σ)]1−tnand the log-likelihood is as follows,
lnp(t,X|π,μ1,μ2,Σ)=N∑n=1tn(lnπ+lnN(xn|μ1,Σ))+(1−tn)(ln(1−π)+lnN(xn|μ2,Σ))As expected, the maximum likelihood estimate for π, is simply the fraction of points in class C1.
Note: Fitting Gaussian distributions to the classes is not robust to outliers, because the maximum likelihood estimation of a Gaussian is not robust itself.
Consider the case of discrete binary feature values xi∈{0,1}. When there are D inputs, then a general distribution would correspond to 2D−1 independent variables. Assuming a naive Bayes approach, we have the following class-conditional mass functions,
p(x|Ck)=D∏i=1μxiki(1−μki)1−xiFor K classes, substituting into (4.63), gives,
αk(x)=D∑i=1(xilnμki+(1−xi)ln(1−μki))+lnp(Ck)In the more general case, where discrete variables can take M>2 states, the class-conditional mass functions are defined as follows,
p(x|Ck)=D∏i=1M∏m=1μϕ(xi)mkimwhere ϕ(xi) produces a 1-of-M binary coding scheme, where only one of the value among ϕ(xi)1,…,ϕ(xi)M is 1, and the others are all 0. Thus, by substituting the expression above into (4.63), gives,
αk(x)=D∑i=1M∑m=1(ϕ(xi)mlnμkim)+lnp(Ck)# number of training points
N = 100
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=21
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = GenerativeClassifier()
model.fit(x_train, t)
predicted = model.predict(np.array([np.ravel(x1), np.ravel(x2)]))
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.show()
An alternative approach, called discriminative training, is to directly maximize the likelihood function defined through the conditional distribution p(Ck|x).
Consider the binary classification problem. In the analysis of generative approaches we saw that under rather general assumptions, the posterior probability of class C1 can be expressed as a logistic sigmoid acting on a linear function of the input vectors x or the feature vector ϕ (see 4.65) so that,
p(C1|ϕ)=y(ϕ)=σ(wTϕ)In the terminology of statistics, this model is known as logistic regression, although its a classification model.
One advantage of the discriminative approach is that there are typically fewer adaptive parameters to be determined. For an M-dimensional feature space, this model has M adjustable parameters. By contrast, the generative model using Gaussian class conditional densities, would have used 2M parameters for the means and M(M+1)/2 parameters for the (shared) covariance matrix.
We can use maximum likelihood to determine the parameters of the logistic regression model. Given a data set {ϕn,tn}, where tn∈{0,1}, the likelihood function is given by,
p(t|Φ,w)=N∏n=1p(C1|ϕn)tn(1−p(C1|ϕn))1−tn=N∏n=1ytnn{1−yn}1−tnThe maximum likelihood is equivalent to the minimum of the negative of the logarithm of the likelihood, which gives the cross-entropy error function,
E(w)=−lnp(t|Φ,w)=−N∑n=1tnlnyn+(1−tn)ln(1−yn)Why is the error function called cross-entropy?
The cross-entropy for discrete probability distributions p and q is defined as H(p,q)=∑xp(x)logq(x). Since we assume that the target variables tn are probabilities taking only extreme values 0 or 1, and yn is a probability distribution, then E(w) can be interpreted as the cross entropy of the target variables and the posterior probability distribution.
Then, taking the gradient of the error function over w, we obtain,
∇E(w)=−∇lnp(t|Φ,w)=−∇N∑n=1tnlnyn+(1−tn)ln(1−yn)=−N∑n=1ddyntnlnyn+ddyn(1−tn)ln(1−yn)ddxlnf(x)=f′(x)f(x)=−N∑n=1ddyntnlnyn+ddyn(1−tn)ln(1−yn)=−N∑n=1tnynddanynddwan−1−tn1−ynddanynddwan=−N∑n=1(tnyn−1−tn1−yn)ddanynddwan=−N∑n=1(tnyn−1−tn1−yn)yn(1−yn)ϕn(4.88)=−N∑n=1tn−ynyn(1−tn)yn(1−yn)ϕn=N∑n=1(yn−tn)ϕnNote that the gradient takes the same form as the gradient of the sum-of-squares error function, however, yn involves a non-linear function. At this point we can make use of (4.91) and (3.22) to obtain a sequential algorithm (gradient descent) for optimizing the parameters.
# number of training points
N = 100
# number of outlier points
n_outliers = 5
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=12
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T
outliers = np.random.random_sample((n_outliers, 2)) + 3
x_train_outliers = np.vstack((x_train, outliers))
t_outliers = np.hstack((t, np.ones(n_outliers, dtype=int)))
feature = LinearFeature()
x_train_linear = feature.transform(x_train)
x_train_linear_outliers = feature.transform(x_train_outliers)
x_test_linear = feature.transform(x_test)
model = LogisticRegression()
model.fit_lms(x_train_linear, t, 0.01)
predicted = model.predict(x_test_linear)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Original decision boundary")
model.fit_lms(x_train_linear_outliers, t_outliers, 0.01)
predicted_outliers = model.predict(x_test_linear)
plt.subplot(1, 2, 2)
plt.scatter(x_train_outliers[:, 0], x_train_outliers[:, 1], c=t_outliers)
plt.contourf(x1, x2, predicted_outliers.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Outliers decision boundary")
plt.show()
Note that logistic regression is robust to outliers in contrast to linear discriminants presented in section 4.1.
In linear regression models, the maximum likelihood solution, on the assumption of Gaussian noise model, leads to a closed-form solution. For logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the error function is still convex and can be minimized by an efficient iterative technique based on Newton-Raphson iterative optimization scheme. This algorithm uses a local quadratic approximation to the log-likelihood, and takes the form
wnew=wold−H−1∇E(w)where H is the Hessian matrix whose elements comprise the second derivatives of E(w) over w.
Note that, if we apply the Newton-Raphson algorithm to the linear regression model, we derive the standard least squares solution (see 4.94 and 4.95).
Applying the Newton-Raphson update to the cross-entropy error function for the logistic regression model, we obtain,
∇E(w)=N∑n=1(yn−tn)ϕn=ΦT(y−t)and
H=∇∇E(w)=∇N∑n=1(yn−tn)ϕn=∇N∑n=1ynϕn(4.88)=N∑n=1yn(1−yn)ϕnddwwTϕn=N∑n=1yn(1−yn)ϕnϕTn=ΦTRΦwhere R is a diagonal matrix whose elements are Rnn=yn(1−yn). Then, the update formula becomes,
wnew=wold−(ΦTRΦ)−1ΦT(y−t)Note that Hessian depends on w through the weighting matrix R, corresponding to the fact that the error function is no longer quadratic. Thus, we must apply the update formula iteratively, each time using the new weight vector w to compute the revised weighting matrix R. To that end, the algorithm is known as iterative reweighted least squares or IRLS.
The elements of R can be interpreted as variances, given by,
E[t]=∑t∈{0,1}tp(t|x)=σ(x)and
var[t]=E[t2]−E[t]2t2=t=E[t]−E[t]2=σ(x)−σ(x)2=y(1−y)# number of training points
N = 100
# number of outlier points
n_outliers = 5
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=12
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T
outliers = np.random.random_sample((n_outliers, 2)) + 3
x_train_outliers = np.vstack((x_train, outliers))
t_outliers = np.hstack((t, np.ones(n_outliers, dtype=int)))
feature = LinearFeature()
x_train_linear = feature.transform(x_train)
x_train_linear_outliers = feature.transform(x_train_outliers)
x_test_linear = feature.transform(x_test)
model = LogisticRegression()
model.fit(x_train_linear, t)
predicted = model.predict(x_test_linear)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Original decision boundary")
model.fit_lms(x_train_linear_outliers, t_outliers, 0.01)
predicted_outliers = model.predict(x_test_linear)
plt.subplot(1, 2, 2)
plt.scatter(x_train_outliers[:, 0], x_train_outliers[:, 1], c=t_outliers)
plt.contourf(x1, x2, predicted_outliers.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.title("Outliers decision boundary")
plt.show()
Newton-Raphson method requires to compute the Hessian (through solving a set of linear equations) and thus, the computational cost for each iteration is higher than that of gradient descent. However, it usually converges faster than gradient descent in the sense that the number of iterations required is much smaller.
We have seen that for K>2 classes, the posterior probabilities are given by a softmax transformation of linear functions of feature variables. Here, we consider the maximum likelihood to determine the parameters wk of the model directly. To that end, we need to calculate the derivatives of yk (see 4.104) over the activation functions αj (see 4.68 and 4.105).
In order to find the derivatives, we need to consider k≠j and k=j.
where we have used the quotient rule (fg)′=f′g−fg′g2.
Combining (1) and (2), we obtain,
∂yk∂αk=yk(Ikj−yj)where Ikj are the elements of the identity matrix.
Assuming a 1-of-K coding scheme in which the target vector tk is a binary vector having all elements zero except for element k, which equals to one, then ,the likelihood function is then given by,
p(T|w1,…,wk)=N∏n=1K∏k=1p(Ck|ϕn)tnk=N∏n=1K∏k=1yk(ϕn)tnkwhere T is a N×K matrix of target variables with elements tnk. Taking the negative logarithm the gives,
E(w1,…,wk)=−lnp(T|w1,…,wk)=−N∑n=1K∑k=1tnklnyk(ϕn)which is the cross-entropy error function for the multiclass problem.
Taking the gradient of the error function over the parameter vector wj, we obtain
∇wjE(w1,…,wK)=−∇wjN∑n=1K∑k=1tnklnyk(ϕn)=−N∑n=1K∑k=1tnk1ynkynk(Ikj−ynj)ϕn=−N∑n=1K∑k=1tnk(Ikj−ynj)ϕn=N∑n=1K∑k=1tnkynjϕn−N∑n=1K∑k=1tnkIkjϕn=N∑n=1K∑k=1tnkynjϕn−N∑n=1tnjϕn∑ktnk=1=N∑n=1ynjϕn−N∑n=1tnjϕn=N∑n=1(ynj−tnj)ϕnThe Newton-Raphson update formula, requires the evaluation of the Hessian matrix, given by
∇wk∇wjE(w1,…,wK)=−∇wk∇wjlnp(T|w1,…,wk)=−∇wkN∑n=1(ynj−tnj)ϕn=−∇wkN∑n=1ynjϕn=N∑n=1∂∂wkynjϕn=N∑n=1ynk(Ikj−ynj)ϕnϕTnBelow we present a softmax regression example trained using gradient descent.
# number of training points
N = 100
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=3, n_clusters_per_class=1, n_samples=N, random_state=21
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = SoftmaxRegression()
model.fit(x_train, t)
predicted = model.predict(x_test)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 2, 4))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("$x_1$", fontsize=12)
plt.ylabel("$x_2$", fontsize=12)
plt.show()
In contrast to the Bayesian treatment of linear regression, in the Bayesian treatment of logistic regression, we cannot integrate exactly over the parameter vector w since the posterior distribution is no longer Gaussian. To that end, we may use a widely used framework called the Laplace approximation, that aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. Consider a single continuous variable z, having a distribution p(z) defined by
p(z)=1Zf(z)where Z=∫f(z)dz is the normalization coefficient. The goal is to find a Gaussian approximation q(z) centered on a mode of the distribution p(z). The first step is to find a mode of p(z), that is, a point z0 such that p′(z0)=0 or equivalently
df(z)dz|z=z0=0Then, we use a second-order Taylor expansion to approximate g(z)=lnf(z) (because the logarithm of any Gaussian distribution is a quadratic function of the variables), centered on the mode z0 so that,
g(z)=lnf(z)≈2∑n=0g(n)(z0)n!(z−z0)n=g(z0)−12g″(z0)(z−z0)2Note that the first-order term is omitted since z0 is a local maximum of the distribution and thus the derivative is zero. Then, taking the exponential on both sides of the expansion, we obtain
f(z)≈f(z0)exp{−A2(z−z0)2}where A=−d2f(z)d2z on z0. Therefore, using the standard result for the normalization of a Gaussian, the final normalized distribution q(z) has the form,
q(z)=(A2π)1/2exp{−A2(z−z0)2}=N(z|z0,A)Note that the Gaussian approximation is well defined only when its precision A>0, which implies that z0 must be a local maximum, not a minimum! In practice a mode may be found by running some form of numerical optimization. The Laplace approximation is depicted in the next Figure,
The same approximation can be applied to an M-dimentional space z.
Exact Bayesian inference for logistic regression, as well as, the evaluation of the predictive distribution are intractable. Thus, here, we consider the application of Laplace approximation for the problem of Bayesian logistic regression.
Similar to the Bayesian linear regression, we seek a Gaussian representation for the posterior distribution of the parameters, thus, we use a Gaussian (conjugate) prior in the general form,
p(w)=N(w|m0,S0)where m0,S0 are fixed hyperparameters. Then, the posterior over w is given by
p(w|t)∝p(t|w)p(w)Taking the natural logarithm on both sides, and substituting for the prior and the likelihood, we obtain
lnp(w|t)=ln(N∏n=1ytnn{1−yn}1−tn)lnN(w|m0,S0)=N∑n=1{tnlnyn+(1−tn)ln(1−yn)}+1(2π)D/2|S0|1/2−12(w−m0)TS−10(w−m0)Then, to obtain the Gaussian approximation of the posterior, first, we maximize the posterior to give the MAP solution wMAP, which corresponds to the mean of the approximated Gaussian. The covariance matrix is then given by the Hessian matrix of the negative log-likelihood,
S−1N=−∇∇lnp(w|t)=S−10+N∑n=1yn(1−yn)ϕnϕTnwhere we make use of (4.97). Thus, the Gaussian approximation of the posterior takes the form,
q(w)=N(w|wMAP,SN)The predictive distribution for class C1, given a new feature vector ϕunseen, is obtained by marginalizing over the posterior distribution p(w|t), which is approximated by q(w), so that,
p(C1|ϕunseen,t,Φ)=∫p(C1|ϕunseen,w)p(w|t,Φ)dw≈∫σ(wTϕ)q(w)dwThe evalution of the above integral is fairly complex and involves a significant number of steps. For more details see the corresponding section in the book. The final approximate predictive distribution has the form,
p(C1|ϕunseen,t)=σ(κ(σ2α)μα)where κ(σ2α)=(1+πσ2α/8)−1/2, μa=wTMAPϕunseen, and σα=ϕTunseenSNϕunseen.
# number of training points
N = 100
x_train, t = make_classification(
n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, n_samples=N, random_state=21
)
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
x_test = np.array([x1, x2]).reshape(2, -1).T
model = BayesianLogisticRegression()
model.fit(x_train, t)
predicted = model.predict(x_test)
plt.scatter(x_train[:, 0], x_train[:, 1], c=t)
plt.contourf(x1, x2, predicted.reshape(N, N), alpha=0.2, levels=np.linspace(0, 1, 5))
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.colorbar()
plt.show()