import math
import numpy as np
import matplotlib.pyplot as plt
from prml.datasets import generate_toy_data
from prml.preprocessing import PolynomialFeature
from prml.linear.linear_regression import LinearRegression
from prml.linear.ridge_regression import RidgeRegression
from prml.distribution import Gaussian, MultivariateGaussian
# Set random seed to make deterministic
np.random.seed(0)
# Ignore zero divisions and computation involving NaN values.
np.seterr(divide="ignore", invalid="ignore")
# Enable higher resolution plots
%config InlineBackend.figure_format = 'retina'
For presentation purposes, consider a synthetically generated example dataset. The data were generated from the function sin(2πx) by adding random Gaussian noise having standard deviation σ=0.3.
We generated N=10 observations spaced uniformly in range [0,1]. These observations comprise the input data vector,
x=(x1,…,xN)TFor each generated observation x we obtained its corresponding value of the function sin(2πx) and then adding the random noise to capture the real-life situation of missing information.
t=(t1,…,tN)T# Sine function
def sin(x: np.ndarray) -> np.ndarray:
return np.sin(2 * np.pi * x)
# Generate a train data set
x_train, y_train = generate_toy_data(sin, 10, 0.3)
# Generate a test data set
x_test = np.linspace(0, 1, 100)
y_test = sin(x_test)
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="Training data")
plt.plot(x_test, y_test, color="g", label="$\sin(2\pi x)$")
plt.xlabel("$x$", fontsize=14)
plt.ylabel("$y$", fontsize=14)
plt.legend(fontsize=14)
plt.show()
The generated training dataset of N=10 points is shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data.
Our goal is to predict the value of ˆt for some new value of ˆx, in the absence of any knowledge for the green curve. To that end, we consider a simple approach based on curve fitting. In particular, we shall try to fit the data using a polynomial function of the form
y(x,w)=w0+w1x+w2x2+⋯+wMxM=M∑j=0wjxjwhere M is the order of the polynomial. Functions, such as y(x,w), that are linear functions of the unknown coefficients w, are called linear models.
Next, we need to determine the values of the coefficients w by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the misfit between the function y(x,w), for a given value of w, and the training data points.
One simple error function is the sum of squares of the errors between y(x,w) and the corresponding target values tn
E(w)=12N∑n=1(y(x,w)−tn)2≥0where the function becomes zero if, and only if, the function y(x,w) were to pass exactly through each training data point.
We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible. Because the error function is quadratic, its derivatives are linear, and so the minimization of the function has a unique closed from solution, denoted by w⋆. To minimize the error function we should derive the gradient vector, set it equal to zero and solve for w⋆ as follows,
∇E(w⋆)=0First, we have to substitute the polynomial into the error function
E(w)=12N∑n=1(M∑j=0wjxjn−tn)2Note that each of the N data points from the generated training set has 1 dimension, that is x∈R. However, the polynomial function populates M features for each input x, essentially transforming x into a M-dimensional vector. Thus, the training set x can be written as a N×M matrix X where Xnj represents xjn, that is, the nth input value raised in the power of j.
To find the gradient vector, we take the partial derivative of E with respect to an arbitrary wk. Differentiating the sum, term by term, we get
∇E(w⋆)k=∂∂wk(w)=12N∑n=12(M∑j=0wjxjn−tn)xkn=N∑n=1(M∑j=0wjxjn−tn)xkn=N∑n=1(Xw−t)nXnk=N∑n=1XTkn(Xw−t)n=(XT(Xw−t))kUsing the partial derivative for one component, we compute the gradient vector by dropping the k subscript. Thus, the minimizer w⋆ must satisfy
∇E(w⋆)=XT(Xw⋆−t)=0Solving for w⋆ gives the unique solution of the curve fitting problem
XT(Xw⋆−t)=0⇔XTXw⋆=XTt⇔w⋆=(XTX)−1XTtThe resulting polynomial is given by the function y(x,w⋆).
There remains the problem of choosing the order M of the polynomial, which is an example of the important concept called model selection.
In order to study the effect of different M values, we plot the result of fitting polynomials having orders M=0,1,3,9 to the data set.
for i, degree in enumerate([0, 1, 3, 9]):
plt.subplot(2, 2, i + 1)
plt.tight_layout()
feature = PolynomialFeature(degree)
x_train_features = feature.transform(x_train)
x_test_features = feature.transform(x_test)
model = LinearRegression()
model.fit(x_train_features, y_train)
y, _ = model.predict(x_test_features)
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="Training data")
plt.plot(x_test, y_test, color="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y, color="r", label="Fitting")
plt.ylim(-2, 2)
plt.title("$M={}$".format(degree), fontsize=14)
plt.legend(bbox_to_anchor=(1, 0.85), loc=2, borderaxespad=1, fontsize=14)
plt.show()
Note that the constant (M=0) and first order (M=1) polynomials give rather poor fits to the data. The third order (M=3) polynomial seems to give the best fit, while the higher order one (M=9) achieves an excellent fit to the data, that is, E(w⋆)=0. However, the fitted curve gives a poor representation of the underlying function sin(2πx). This phenomenon is known as over-fitting.
A more quantitative insight into the generalization performance on M can be obtained by using the root-mean-square (RMS) error defined as
ERMS=√2E(w⋆)NThe RMS error on both training and test data points for each value of M is shown in the following figure:
def rms_error(a, b):
return np.sqrt(2 * np.mean(np.square(a - b)))
training_errors = []
test_errors = []
for i in range(10):
feature = PolynomialFeature(i)
x_train_features = feature.transform(x_train)
x_test_features = feature.transform(x_test)
model = LinearRegression()
model.fit(x_train_features, y_train)
y, _ = model.predict(x_test_features)
training_errors.append(rms_error(model.predict(x_train_features), y_train))
test_errors.append(
rms_error(model.predict(x_test_features), y_test + np.random.normal(scale=0.3, size=len(y_test)))
)
plt.plot(training_errors, "o-", mfc="none", mec="b", ms=10, c="b", label="Training error")
plt.plot(test_errors, "o-", mfc="none", mec="r", ms=10, c="r", label="Test error")
plt.xlabel("$M$", fontsize=14)
plt.ylabel("$E_{RMS}$", fontsize=14)
plt.legend(fontsize=14)
plt.show()
The test set error is measuring how well we are doing in predicting the values of t for new data observations of x. For M=9, the training set error goes to zero, because the polynomial contains 10 degrees of freedom and so it can be tuned exactly to the 10 data points in the training set.
It is also interesting to examine the behavior of the model as the size of the data increases. The following figure depicts the result of fitting the M=9 polynomial for N=15 and N=100 data points.
for i, size in enumerate([15, 100]):
plt.subplot(1, 2, i + 1)
# Generate a train set
x_train_100, y_train_100 = generate_toy_data(sin, size, 0.3)
# Generate a test set
x_test_100 = np.linspace(0, 1, 100)
y_test_100 = sin(x_test)
feature = PolynomialFeature(9)
x_train_100_features = feature.transform(x_train_100)
x_test_100_features = feature.transform(x_test_100)
model = LinearRegression()
model.fit(x_train_100_features, y_train_100)
y, _ = model.predict(x_test_100_features)
plt.scatter(x_train_100, y_train_100, facecolor="none", edgecolor="b", s=50, label="Training data")
plt.plot(x_test_100, y_test_100, color="g", label="$\sin(2\pi x)$")
plt.plot(x_test_100, y, color="r", label="Fitting")
plt.ylim(-1.5, 1.5)
plt.title("$N={}$".format(size), fontsize=14)
plt.legend(bbox_to_anchor=(1, 0.64), loc=2, borderaxespad=1, fontsize=14)
plt.show()
Note that the over-fitting problem becomes less severe as the size of the data set increases. In other words, the larger the data set, the more complex the model that we can afford to fit to the data.
One technique that is often used to control the over-fitting phenomenon is that of regularization, which adds a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest such penalty term is the sum of squares of all of the coefficients, leading to a modified error function of the following form
˜E(w)=12N∑n=1(y(x,w)−tn)2+λ||w||22Such techniques are known as shrinkage methods because they reduce the value of the coefficients. The particular case of the quadratic regularization is called ridge regression. In neural networks, this approach is also known as weight decay.
Similar to the previous case, the ridge error function can be minimized exactly in closed form as follows,
∇E(w⋆)k=∂∂wk(w)=12N∑n=12(M∑j=0wjxjn−tn)xkn+12λ2wk=N∑n=1(M∑j=0wjxjn−tn)xkn+λwk=N∑n=1(Xw−t)nXnk+λwk=N∑n=1XTkn(Xw−t)n+λwk=(XT(Xw−t))k+λwkUsing the partial derivative for one component, we compute the gradient vector by dropping the k subscript. Then, the minimizer w⋆ must satisfy
∇E(w⋆)=XT(Xw⋆−t)+λw⋆I=0Solving for w⋆ gives the unique solution that minimizes the ridge error
XT(Xw⋆−t)+λw⋆I=0⇔XTXw⋆−XTt+λw⋆I=0⇔XTXw⋆+λw∗I=XTt⇔w⋆(XTX+λI)=XTt⇔w⋆=(XTX+λI)−1XTtThe following figures depict the results of fitting the polynomial of order M=9 to the same data set as before but this time using the regularized error function. We see that, for a value of lnλ=−18, the over-fitting has been suppressed and we obtain a much closer representation of the underlying function sin(2πx). If, however, we use too large a value for λ then we again obtain a poor fit.
feature = PolynomialFeature(9)
x_train_features = feature.transform(x_train)
x_test_features = feature.transform(x_test)
for i, ln_lambda in enumerate([float("-inf"), -18, 0]):
plt.subplot(1, 3, i + 1)
plt.tight_layout()
model = RidgeRegression(alpha=math.exp(ln_lambda))
model.fit(x_train_features, y_train)
y, _ = model.predict(x_test_features)
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="Training data")
plt.plot(x_test, y_test, color="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y, color="r", label="Fitting")
plt.ylim(-2, 2)
plt.title("$\ln\lambda={}$".format(ln_lambda), fontsize=14)
plt.legend(bbox_to_anchor=(1, 0.65), loc=2, borderaxespad=1, fontsize=14)
plt.show()
A sample of coefficients from the fitted polynomials is presented in the table below, showing that regularization has the desired effect of reducing the magnitude of the coefficients.
lnλ=−∞ | lnλ=−18 | lnλ=0 |
---|---|---|
0.39 | 0.38 | 0.36 |
−135.04 | −2.14 | −0.46 |
3206.76 | 81,88 | −0.39 |
−29215.92 | −390.51 | −0.22 |
139594.34 | 578.12 | −0.05 |
−388863.80 | −31.48 | 0.07 |
652373.24 | −49.12 | 0.17 |
−648124.69 | −28.57 | 0.25 |
350721.94 | 540.17 | 0.31 |
−79556.29 | −255.65 | 0.35 |
The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error for both training and test sets against lnλ, as shown in the next figure. We see that λ controls the effective complexity of the model and hence determines the degree of over-fitting.
training_errors = []
test_errors = []
feature = PolynomialFeature(9)
x_train_features = feature.transform(x_train)
x_test_features = feature.transform(x_test)
ln_lambda_values = range(-40, -20, 1)
for ln_lambda in ln_lambda_values:
model = RidgeRegression(alpha=math.exp(ln_lambda))
model.fit(x_train_features, y_train)
training_errors.append(rms_error(model.predict(x_train_features)[0], y_train))
test_errors.append(
rms_error(model.predict(x_test_features)[0], y_test + np.random.normal(scale=0.3, size=len(y_test)))
)
plt.plot(ln_lambda_values, training_errors, mfc="none", mec="b", ms=10, c="b", label="Training error")
plt.plot(ln_lambda_values, test_errors, mfc="none", mec="r", ms=10, c="r", label="Test error")
plt.xlabel("$\ln\lambda$", fontsize=14)
plt.ylabel("$E_{RMS}$", fontsize=14)
plt.legend(fontsize=12)
plt.show()
The probability p(A) of an event A is always a non-negative number, i.e.,
p(A)≥0The probability p(B) of an event B which is certain to occur is always equal to one, i.e.,
p(B)=1In case two events A and B are mutually exclusive, that is, they cannot occur simultaneously p(A∩B)=0, then the probability of occurrence of either A or B is denoted as A∪B and is given by
p(A∪B)=p(A)+p(B)−2p(A∩B)=p(A)+p(B)Consider the slightly more general example involving two random variables X and Y instead of just two events (which are essentially binary random variables). Suppose that X can take any of the values xi, and Y can take the values yj. Moreover, consider a total of N trials, and let the number of such trials in which X=xi and Y=yj be nij. Also, let the number of trials in which X takes the value xi (irrespective of the value that Y takes) be denoted by ci, and similarly let the number of trials in which Y takes the value yj be denoted by rj.
Then the marginal, conditional and joint probabilities are given by
p(X=xi)=ciNp(Y=yj)=rjNp(X=xi,Y=yj)=nijNp(Y=yj|X=xi)=rjciNote that the joint probability p(X=xi,Y=yj) is short notation for p(X=xi∩Y=yj).
Applying the sum rule as above is called "marginalizing out Y".
Computing p(Y|X) is called "conditioning on X". The product rule is generalized as follows
p(X1,X2,…,XK)=p(XK|XK−1,…,X1)p(XK−1,…,X1),…Note that if the joint distribution of two random variables factorizes into the product of their marginals, so that p(X,Y)=p(X)p(Y), then X and Y are said to be statistically independent. In such case, the product rule becomes p(Y|X)=p(Y).
From the product rule, we can immediately obtain the Bayes' theorem, using the symmetry property p(X,Y)=p(Y,X) as follows
p(X,Y)=p(Y|X)p(X)⇔p(Y|X)=p(X,Y)p(X)⇔p(Y|X)=p(Y,X)p(X)⇔p(Y|X)=p(X|Y)p(Y)p(X)Using the sum and product rules, the marginal probability p(X) in the denominator can be expressed in terms of the quantities in the numerator
p(Y|X)=p(X|Y)p(Y)∑Yp(X,Y)=p(X|Y)p(Y)∑Yp(X|Y)p(Y)An interpretation of the Bayes theorem is that if we had been asked which is the most probable value of Y, before we observe any value for X, then the most complete information we have available is provided by the prior probability p(Y). After we observe the value of X, we can use the Bayes theorem to compute the the posterior probability p(Y|X), which represents our updated knowledge after incorporating the evidence provided by the observed data.
Let w be parameters and D be data. Bayes theorem is given by
p(w|D)=p(D|w)p(w)p(D)⇔posterior=likelihood×priorevidenceThe frequentist paradigm generally quantifies the properties of data driven quantities in the light of the fixed model parameters, while the Bayesian paradigm generally quantifies the properties of unknown model parameters in light of observed data.
If the probability of a real-valued variable x falling in the interval (u,u+δ) is given by px(u)δ, then px(x) is called the probability density function over x.
p(u≤x≤u+δ)=∫u+δup(x)dx=Px(u+δ)−Px(u)=Px(u+δ)−Px(u)δδThen we have that
limδ→0Px(u+δ)−Px(u)δδ=dPx(u)dxδ=px(x)δTherefore, the probability that x will lie in an interval (a,b) is then given by
p(a≤x≤b)=∫bapx(x)dxand it must satisfy the following two conditions
px(x)≥0∫∞−∞px(x)dx=1The probability that x lies in the interval (−∞,z) is given by the cumulative distribution function (CDF) given by
Px(z)=∫z−∞px(x)dxwhere
p(x)=dPx(x)dxIf x is a discrete variable, then px(x) is sometimes called a probability mass function (PMF) because it can be regarded as a set of probability masses concentrated at the allowed values of x.
The average value of some function f(x) under a probability distribution px(x) is called the expectation of f(x) and is denoted by E[f]. The average is weighted by the relative probabilities of the different values of x as follows
E[f]=∑xpx(x)f(x)=∫px(x)f(x)dxFor a finite number of N points drawn from the probability distribution, then the expectation can be approximated as a finite sum over these points
E[f]≈1NN∑n=1f(xn)Note that the expectations of functions of several variables, may use a subscript to indicate which variable is being averaged, i.e., Ex[f(x,y)].
Whereas the expectation provides a measure of centrality, the variance of a random variable quantifies the spread of that random variable's distribution. Thus, the variance provides a measure of how much variability there is in f(x) around its mean value E[f(x)] and is defined as follows
var[f]=E[(f(x)−E[f(x)])2]=E[f(x)2−2f(x)E[f(x)]+E[f(x)]2]=E[f(x)2]−E[2f(x)E[f(x)]]+E[E[f(x)]2]=E[f(x)2]−2E[f(x)]E[f(x)]+E[f(x)]2=E[f(x)2]−E[f(x)]2The covariance expresses the extent to which x and y vary together and is given by
cov[x,y]=Ex,y[(x−E[x])(y−E[y])]=Ex,y[xy]−E[x]E[y]A covariance matrix Σ has entries σij corresponding to the covariance of variables i and j. If two variables are independent, then their covariance vanishes, e.g., Σ=I.
The Normal or Gaussian distribution is one of the most important probability distributions for continuous variables. For the case of a single real-valued variable x, the distribution is defined as follows
N(x|μ,σ2)=1(2πσ2)1/2exp{−12σ2(x−μ)2}and is governed by the parameters μ, called the mean, and σ2, called the variance. The square root of the variance σ is called standard deviation. An alternative way to represent a Gaussian distribution is by considering a precision term β=1σ2, denoted by
N(x|μ,β−1)=β1/2√2πσ2exp{−β2(x−μ)2}x_space = np.linspace(-5, 5, 1000)
for mean, var, c in [(0, 1, "g"), (1, 2, "r"), (-1, 3, "b")]:
N_distribution = Gaussian(mean, var)
y = [N_distribution.pdf(x) for x in x_space]
plt.plot(x_space, y, color=c, label="$\mathcal{N}$" + "$(x|{},{})$".format(mean, var))
plt.xlabel("$x$", fontsize=14)
plt.ylabel("$\mathcal{N}(x|\mu,\sigma^2)$", fontsize=14)
plt.legend(fontsize=12)
plt.show()
Note that the probability density function is not an actual probability, therefore it can take values N(x|μ,σ2)>1. We can see that the Gaussian distribution satisfies
N(x|μ,σ2)>0∫N(x|μ,σ2)dx=1The Gaussian distribution can be also defined over a D-dimensional vector x of continuous variables as follows:
N(x|μ,Σ)=1(2π)D/2|Σ|1/2exp{−12(x−μ)TΣ−1(x−μ)}where the D-dimentional vector μ holds the mean of each dimension, while the D×D matrix Σ is the covariance.
mean = np.array([0, 0])
sigma = np.array([[1.0, 0.92], [0.92, 2.0]])
N_distribution = MultivariateGaussian(mean, sigma)
N = 100
x1, x2 = np.meshgrid(np.linspace(-5, 5, N), np.linspace(-5, 5, N))
p = np.zeros((N, N))
for i in range(N):
for j in range(N):
p[i, j] = N_distribution.pdf(np.array([x1[i, j], x2[i, j]]))
cp = plt.contourf(x1, x2, p, cmap="rainbow")
plt.colorbar(cp)
plt.xlabel("$x_1$", fontsize=14)
plt.ylabel("$x_2$", fontsize=14)
plt.axis([-3, 3, -3, 3])
plt.show()
Consider we have a data set of observations x=(x1,…,xN)T drawn independently from a Gaussian distribution. Data points that are drawn independently from the same distribution are said to be independent and identically distributed.
N_distribution = Gaussian(0, 1)
x_sample_data = N_distribution.draw(5)
y_sample_data = [N_distribution.pdf(x) for x in x_sample_data]
x_space = np.linspace(-5, 5, 100)
y = [N_distribution.pdf(x) for x in x_space]
plt.scatter(x_sample_data, y_sample_data, label="Training data")
plt.plot(x_space, y, color="g", label="$\mathcal{N}(x|0,1)$")
plt.legend(fontsize=12)
plt.show()
Because the data are independent, the likelihood function of the univariate Gaussian is as follows,
p(x|μ,σ2)=N∏n=1N(xn|μ,σ2)which corresponds to the product of the blue points in the figure above. Therefore, in order to find the unknown parameters μ and σ2 is to use the observed data set and find the parameter values that maximize the likelihood function.
The log likelihood function can be written as follows
lnp(x|μ,σ2)=ln[N∏n=1N(xn|μ,σ2)]=N∑n=1lnN(xn|μ,σ2)(1.46)=N∑n=1ln(1(2πσ2)1/2exp{−12σ2(xn−μ)2})=N∑n=1ln(1(2πσ2)1/2)+N∑n=1ln(exp{−12σ2(xn−μ)2})=Nln(1(2πσ2)1/2)−N∑n=112σ2(xn−μ)2=Nln1−Nln(2πσ2)1/2−12σ2N∑n=1(xn−μ)2ln1=0=−Nln(2πσ2)1/2−12σ2N∑n=1(xn−μ)2lnxy=ylnx=−N2ln2πσ2−12σ2N∑n=1(xn−μ)2=−N2ln2π−N2lnσ2−12σ2N∑n=1(xn−μ)2In the machine learning literature, the negative log of the likelihood function is called an error function. Since the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error. The following figure depicts the likelihood of μ against σ2.
mu_space = np.linspace(-1, 1, 100)
var_space = np.linspace(0, 2, 100)
mu_mesh, var_mesh = np.meshgrid(mu_space, var_space)
ll = np.zeros((mu_space.size, var_space.size))
for i, mu in enumerate(mu_space):
for j, var in enumerate(var_space):
ll[i, j] = Gaussian(mu, var).likelihood_iid(x_sample_data)
cp = plt.contourf(mu_mesh, var_mesh, ll.T, cmap="rainbow")
plt.colorbar(cp)
plt.xlim(-1, 1)
plt.ylim(0, 2)
plt.xlabel("$\mu$", fontsize=14)
plt.ylabel("$\sigma^2$", fontsize=14)
plt.show()
Then by maximizing with respect to μ, we obtain the following solution
∂lnp(x|μ,σ2)∂μ=0⇔∂∂μ(N2ln2π−N2lnσ2−12σ2N∑n=1(xn−μ)2)=0⇔∂∂μ(12σ2N∑n=1(xn−μ)2)=0⇔∂∂μN∑n=1(xn−μ)2=0⇔N∑n=1(2μ−2xn)=0⇔2N∑n=1(μ−xn)=0⇔Nμ−N∑n=1xn=0⇔Nμ=N∑n=1xn⇔μML=1NN∑n=1xnwhich is the sample mean of the observed values x. In a similar manner, maximizing with respect to σ2, we obtain the maximum likelihood solution for the variance as follows
∂lnp(x|μ,σ2)∂σ2=0⇔∂∂σ2(N2ln2π−N2lnσ2−12σ2N∑n=1(xn−μ)2)=0⇔∂∂σ2(−12σ2N∑n=1(xn−μ)2−N2lnσ2)=0⇔∂∂σ2(−12σ2N∑n=1(xn−μ)2)+∂∂σ2(−N2lnσ2)=0⇔∂∂σ2(−(2σ2)−1N∑n=1(xn−μ)2)+−N2σ2=0⇔14σ4N∑n=1(xn−μ)2−N2σ2=0⇔14σ4N∑n=1(xn−μ)2=N2σ2⇔σ2ML=1NN∑n=1(xn−μML)2which is the sample variance measured with respect to the sample mean μML.
A problem that arises in the context of our solutions for the maximum likelihood approach is that it systematically underestimates the variance of the Gaussian distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting. First note that the maximum likelihood solutions μML and σML are functions of the data set values xn. Now, consider the expectations of these quantities with respect to the data set values, which themselves come from a Gaussian distribution with parameters μ and σ2, given by
E[μML]=E[1NN∑n=1xn]=1NN∑n=1E[xn]=1NNμ=μand
E[σ2ML]=E[1NN∑n=1(xn−μML)2]=1NE[N∑n=1(xn−μML)2]=1NE[N∑n=1(x2n−2xnμML+μ2ML)]=1N(E[N∑n=1x2n]−E[N∑n=12xnμML]+E[N∑n=1μ2ML])(1.50)=μ2+σ2−1N(E[N∑n=12xnμML]+E[N∑n=1μ2ML])(1.55)=μ2+σ2−2NE[N∑n=1xn(1NN∑n=1xn)]+E[(1NN∑n=1xn)2]=μ2+σ2−2N2E[N∑n=1xn(N∑n=1xn)]+1N2E[(N∑n=1xn)2]=μ2+σ2−2N2E[(N∑n=1xn)2]+1N2E[(N∑n=1xn)2]=μ2+σ2−1N2E[(N∑n=1xn)2]=μ2+σ2−1N2(N2μ2+Nσ2))=μ2+σ2−μ2+1Nσ2=N−1Nσ2Therefore, on average the maximum likelihood estimate obtains the correct mean but it underestimates the true variance by a factor (N−1)/N. The estimate for the unbiased variance parameter is given by
˜σ2=NN−1σ2ML=1N−1N∑n=1(xn−μML)2N_distribution.ml(x_sample_data, unbiased=False)
y_biased = [N_distribution.pdf(x) for x in x_space]
N_distribution.ml(x_sample_data, unbiased=True)
y_unbiased = [N_distribution.pdf(x) for x in x_space]
plt.plot(x_space, y, color="g", label="True Gaussian distribution")
plt.plot(x_space, y_biased, color="r", ls=":", label="Biased maximum likelihood")
plt.plot(x_space, y_unbiased, color="b", ls=":", label="Unbiased maximum likelihood")
plt.xlabel("$x$", fontsize=14)
plt.legend(bbox_to_anchor=(1, 0.7), loc=2, borderaxespad=1, fontsize=14)
plt.show()
Now, lets return to the curve fitting example and examine it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as, taking us towards a full Bayesian treatment. We shall assume that, the target variable t has a Gaussian distribution having mean equal to y(x,w) of the polynomial curve, given by
p(t|x,w,β)=N(t|y(x,w),β−1)where we have defined β to be the precision parameter, corresponding to the inverse variance.
Each input point x defines a Gaussian distribution located into prediction of y(x,w). We plot the Gaussian distribution heatmap of the values of x against t.
x_sample, t_sample = generate_toy_data(sin, 50, 0.3)
feature = PolynomialFeature(4)
x_features = feature.transform(x_sample)
model = LinearRegression()
model.fit(x_features, t_sample)
x_space = np.arange(0, 1, 0.01)
t_space = np.arange(-2, 2, 0.01)
X, Y = np.meshgrid(x_space, t_space)
Z = np.zeros((x_space.size, t_space.size))
for i, x in enumerate(x_space):
# create the Gaussian distribution for a given input value
predicted_mu, _ = model.predict(feature.transform(x))
g = Gaussian(predicted_mu[0], 0.3)
for j, t in enumerate(t_space):
# compute the value of the Gaussian for each target value
Z[i, j] = g.pdf(t)
cp = plt.contourf(X, Y, Z.T.reshape(X.shape), cmap="rainbow", levels=40)
plt.scatter(x_sample, t_sample, color="snow")
plt.xlim(0, 1)
plt.ylim(-2, 2)
plt.xlabel("$x$", fontsize=14)
plt.ylabel("$t$", fontsize=14)
plt.colorbar(cp)
plt.show()
Then, we can use the training data x,t to determine the values of the unknown parameters w,β by maximum likelihood. Assuming the data are i.i.d, the likelihood function is given by
p(t|x,w,β)=N∏n=1N(tn|y(x,w),β−1)Similar to the Gaussian distribution earlier, we maximize the logarithm of the likelihood function in the form
lnp(t|x,w,β)=−β2N∑n=1(y(x,w)−tn)2+N2lnβ−N2ln2πFor determining the maximum likelihood solution for the polynomial coefficients wML, we can omit the terms that do not depend on w. Since the β does not alter the location of the function it can be replaced by 1. Therefore, the maximization is equivalent to minimizing the sum of squares error function, as presented the polynomial curve fitting example. The sum of squares error function has arisen as a consequence of maximizing the likelihood under the assumption of a Gaussian noise distribution!
As a final step, we can also use maximum likelihood to determine the precision parameter β, as follows:
∂∂βlnp(t|x,w,β)=0⇔∂∂β[−β2N∑n=1(y(x,w)−tn)2+N2lnβ−N2ln2π]=0⇔∂∂β[−β2N∑n=1(y(x,w)−tn)2]+∂∂β[N2lnβ]=0⇔−12N∑n=1(y(x,w)−tn)2+N21β=0⇔N∑n=1(y(x,w)−tn)2=Nβ⇔1βML=1NN∑n=1(y(x,w)−tn)2Since we have a probabilistc model, we can use the predictive distribution that gives the probability distribution over t, rather than a simple point estimate.
p(t|x,wML,βML)=N(t|y(x,wML),β−1ML)We can further introduce a prior distribution over the polynomial coefficients w, in order to take a step towards a more Bayesian approach. Consider for instance a Gaussian prior of the form
p(w|α)=N(w|0,α−1I)=(α2π(M+1)/2)exp{−α2wTw}where α is the precision of the multivariate Gaussian, and M is the order of the polynomial, that is, the dimension of the paremeter vector w. Then from the Bayes theorem we have that
p(w|x,t,α,β)∝p(t|x,w,β)p(w|α)Therefore, we can determine w, by maximizing the posterior distribution. This technique is known as maximum posterior or MAP inference. The maximum of the posterior is given by the minimum of the negative logarithm, which it can be proved to be
β2N∑n=1(y(x,w)−tn)2+α2wTwThus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum of squared error function encountered in Regularization, where λ=α/β.
How much information is received when we observe a specific variable?
Consider for instance the event of seeing an alien spaceship appearing in the sky. We have never seen one and we would be extremely surprised if one day such an event occured, since it is unexpected given our current knowledge. Therefore, the amount of information can be viewed as the degree of surprise on learning the value of x. The measure of information content is therefore depend on the probability distribution p(x) and in particular is given by the logarithm of p(x) as follows,
h(x)=−log2p(x)where the negative sign ensures that h(x)≥0. When the base of the logarithm is 2 the units of h(x) are bits.
def h(prob: float) -> float:
if not 0 <= prob <= 1:
raise ValueError("Probability should be [0,1].")
elif prob == 0:
return float("-inf")
else:
return -math.log2(prob)
px_space = np.linspace(0, 1, 100)
y = [h(px) for px in px_space]
plt.plot(px_space, y)
plt.xlabel("Probability of event x", fontsize=14)
plt.ylabel("Information content (bits)", fontsize=14)
plt.show()
The average amount of information is obtained by taking the expectation over the information content, given by,
H[x]=−∑xp(x)h(x)=−∑xp(x)log2p(x)This important quantity is called entropy of the random variable X. Consider a discrete random variable X having two possible values x1 and x2 with probabilities p and 1−p.
def entropy(probs):
if 0 in probs:
return 0
else:
return sum([px * h(px) for px in probs])
px_space = np.linspace(0, 1, 100)
y = [entropy([px, 1 - px]) for px in px_space]
plt.plot(px_space, y)
plt.xlabel("Probability of event $x_1$", fontsize=14)
plt.ylabel("Entropy (bits)", fontsize=14)
plt.show()
Note that the entropy is maximized when the two events are equiprobable, that is, for uniform distributions. The entory is also defined in a similar way for continous random variables and is called differential entropy:
H[x]=−∫p(x)lnp(x)dxIn the case of a joint distribution p(x,y), the average additional information needed to specify y given that the value of x is already known, is called conditional entropy, and is given by
H[y|x]=−∫p(x,y)lnp(y|x)dxdyMoreover it is easily seen, using the product rule, that the conditional entropy satisfies the relation,
H[x,y]=−∫∫p(x,y)lnp(x,y)dxdy=−∫∫p(x,y)ln(p(y|x)p(x))dxdy=−∫∫p(x,y)(lnp(y|x)+lnp(x))dxdy=−∫∫p(x,y)lnp(y|x)dxdy−∫∫p(x,y)lnp(x)dxdy=−∫∫p(x,y)lnp(y|x)dxdy−∫∫p(x,y)dylnp(x)dx=−∫∫p(x,y)lnp(y|x)dxdy−∫p(x)lnp(x)dx=H[y|x]+H[x]stating that the information needed to describe x and y is given by the information needed to describe x alone plus the additional information to specify y.
Consider some unknown distribution p(x) that we have modelled using an approximate distribution q(x). Then, the average additional information required to specify a value of x as a result of using q(x) instead of p(x) is given by
KL(p||q)=−∫p(x)lnq(x)dx−(−∫p(x)lnq(x)dx)=−∫p(x)(lnq(x)−lnp(x))dx=−∫p(x)ln(q(x)p(x))dx.This is known as the relative entropy or Kullback-Leibler divergence between distributions p(x) and q(x). Note that it is not a symmetrical quantity, that is to say KL(p||q)≠KL(q||p). Moreover the Kullback-Leibler divergence satisfies KL(p||q)≥0, where the equality holds if, and only if, p(x)=q(x). Thus, we can interpret the Kullback-Leibler divergence as a measure of dissimilarity of the two distributions.
Since the most efficient compression is achieved when we know the true distribution, otherwise additional information is required, there is an important relationship between data compression and density estimation.
Suppose we are trying to approximate the unknown distribution using some parametric distribution q(x|θ), governed by a set of adjustable parameters θ, for instance a multivariate Gaussian distribution. One way to determine θ is to minimize the Kullback-Leibler divergence between p(x) and q(x|θ). This is impossible since we do not know p(x). However, if we have observed a finite set of training points xn drawn from p(x), then the expectation with respect to p(x) can be approximated by a finite sum over these points, using (1.35), so that,
KL(p||q)≈1NN∑n=1(−lnq(xn|θ)+lnp(xn))Since the second term is independent of θ, minimizing this approximated Kullback-Leibler divergence is equivalent to minimizing the log likelihood function of q(xn|θ) evaluated on the training set.
Consider the joint distribution p(x,y) between two sets of variables x and y. If the sets are independent, then p(x,y)=p(x)(y). If the variables are not independent, we can measure whether they are close to being independent* by considering the Kullback-Leibler divergence between the joint distribution and the product of their marginals, given by
I[x,y]=KL(p(x,y)||p(x)p(y))=−∫∫p(x,y)ln(p(x)p(y)p(x,y))dxdywhich is called the mutual information of variables x and y. The mutual information satisfies I[x,y]≥0 where the equality holds if, and only if, x and y are independent. Moreover, using the sum and product rules of probability, we can prove that the mutual information is related to the conditional entropy,
I[x,y]=−∫∫p(x,y)ln(p(x)p(y)p(x,y))dxdy=−∫∫p(x,y)ln(p(x)p(y)p(x|y)p(y))dxdy=−∫∫p(x,y)ln(p(x)p(x|y))dxdy=−∫∫p(x,y)(lnp(x)−lnp(x|y))dxdy=−∫∫p(x,y)lnp(x)dxdy+∫∫p(x,y)lnp(x|y)dxdy=−∫∫p(x,y)dylnp(x)dx+∫∫p(x,y)lnp(x|y)dxdy=−∫p(x)lnp(x)dx+∫∫p(x,y)lnp(x|y)dxdy=H[x]−H[x|y]=H[y]−H[y|x]