$\renewcommand{\vec}[1]{\boldsymbol{#1}}$
Note: the algorithms described here can also be applied to transformed input, $\phi(\vec{x})$, to determine $p\left(C_k\right|\left.\phi(\vec{x})\right)$, where $\phi(\cdot)$ may be a nonlinear transformation of the input space.
Our first model assumption is that our target output can be modeled as
$$y(\phi(\vec{x})) = f(\vec{w}^\intercal\phi(\vec{x}))$$where $y$ will be a vector of probabilities with $K$ elements.
The target variable, $y$, does not provide a decision for a class assignment for a given input $\vec{x}$.
There are a variety of decision models, all of which leverage the class posterior probability models, $p\left(C_k\right|\left.\vec{x}\right)$, such as
The models discussed today are called linear because,
$\vec{w}^\intercal\phi(\vec{x})=\text{constant}$, and thus represent a linear function of $\phi(\vec{x})$.
Our first step is to form a model of the class posterior probabilities for the inference stage:
This can be done by modeling:
Note: It can also be done directly, as in probabilistic discriminative modeling (forward pointer).
Bayes' theorem describes the probability of an event, based on conditions that might be related to the event.
$$ P\left(A\right|\left.B\right) = \frac{P\left(B\right|\left.A\right) \, P(A)}{P(B)}\,,$$where $A$ and $B$ are events.
In this case, for $K$ classes, our activation function will take the form of the softmax function,
$$ f_k = p\left(C_k\right|\left.\vec{x}\right) = \frac{p\left(\vec{x}\right|\left.C_k\right)p\left(C_k\right)}{p(\vec{x})} = \frac{\exp(a_k)}{\sum_j \exp(a_j)}\,, $$where
$$a_j = \ln\left(p\left(\vec{x}\right|\left.C_k\right)p(C_k)\right).$$In the case of $K=2$, the activation function will reduce to the logistic sigmoid function
$$f(\vec{x}) = p\left(C_1\right|\left.\vec{x}\right) = \frac{1}{1+\exp(-a)} = \sigma(a)\,,$$where
$$a = \ln \frac {p\left(\vec{x}\right|\left. C_1\right)p(C_1)} {p\left(\vec{x}\right|\left. C_2\right)p\left(C_2\right)}\,.$$These models are known as generative models because they can be used to generate synthetic input data by applying inverse transform sampling to the marginal distribution for $\vec{x}$:
$$p(\vec{x}) = \sum_k p(\vec{x}|C_k)p(C_k)\,.$$To move forward, it is necessary to start making model assumptions. Here we will assume that:
Under the Gaussian assumption, the class-conditional density for class $C_k$ is
$$p(\vec{x}|C_k) = \frac{1}{\left(2 \pi \right)^{n/2}} \frac{1}{\left|\vec{\Sigma}\right|^{1/2}} \exp \left( -\frac{1}{2} \left(\vec{x} - \vec{\mu}_k\right)^\intercal \vec{\Sigma}^{-1} \left(\vec{x} - \vec{\mu}_k\right) \right)\,,$$where:
Note: here we have assumed that all classes share the same covariance matrix.
A logistic function or logistic curve is a common "S" shape (sigmoid curve), defined as
$$\sigma(x)={\frac {L}{1+\exp(-k(x-x_{0}))}}\,,$$where:
In the case of two classes, this result is substituted into the logistic sigmoid function and reduces to
$$p(C_1|\vec{x}) = \sigma \left( \vec{w}^\intercal \vec{x} + w_0 \right)$$were we have defined:
$$\vec{w} = \mathbf{\Sigma}^{-1} \left( \mathbf{\mu}_1 - \mathbf{\mu}_2 \right),$$$$w_0 = -\frac{1}{2} \vec{\mu}_1^\intercal \vec{\Sigma}^{-1} \vec{\mu}_1 + \frac{1}{2} \vec{\mu}_2^\intercal \vec{\Sigma}^{-1} \vec{\mu}_2 + \ln \frac{p(C_1)} {p(C_2)}$$If we are to use the result above, we will need to make another model assumption.
These results can be extended to the case of $K>2$ classes for which we obtain
$$a_k(\vec{x}) = \left[\vec{\Sigma}^{-1} \vec{\mu}_k \right]^\intercal \vec{x} - \frac{1}{2} \vec{\mu}_k^\intercal \vec{\Sigma}^{-1} \vec{\mu}_k + \ln p(C_k)$$which is used in the first activation function provided at the begin of this section.
We have not formulated a complete model for the posterior class densities, in that we have not yet solved for the model parameters, $\vec{\mu}$ and $\vec{\Sigma}$.
We do that now using a maximum likelihood approach.
The likelihood of a set of parameter values, $\theta$, given outcomes $x$, is the probability of those observed outcomes given those parameter values,
$$\mathcal{L}(\theta |x)=P(x|\theta)$$Considering the case of two classes, $C_1$ and $C_2$, with
where $t_i=0$ indicates that $\vec{x_i}$ is in class $C_1$ and $t_i=1$ indicates that $\vec{x}_i$ is in class $C_2$.
The likelihood function is then given by
$$p\left(\vec{t}, \vec{X} \mid \gamma, \vec{\mu}_1, \vec{\mu}_2, \vec{\Sigma}\right) = \prod_{i=1}^m \left[\gamma \mathcal{N} \left(\vec{x}_i \mid \mu_1, \vec{\Sigma}\right)\right]^{t_i} \left[\left(1-\gamma\right) \mathcal{N}\left(\vec{x}_i \mid \mu_2, \vec{\Sigma}\right)\right]^{1-t_i}\,.$$Taking the derivate of this expression with respect to the various model parameters, $\gamma$, $\mu_1$, $\mu_2$, and $\vec{\Sigma}$, and setting it equal to zero, we obtain
$$\gamma = \frac{N_1}{N_1+N_2}$$where $N_1$ is the number of training inputs in class $C_1$ and $N_2$ is the number in class $C_2$.
$CDF(\vec{X}) = \int_{-\infty}^{\vec{X}} p(\vec{x})d\vec{x}$ (note that the range of the CDF is $[0,1]$).
We will assume that the correct class is that for which the posterior probability $p(\left.C_k\right|\vec{x})$ is greatest, unless the difference between the two posterior probabilities is less than some minimum value in which case we will chose randomly - this will add some "noise" to our input training data.
Select truth data values, these will NOT be known to the training algorithm, they are only used in generating the sample data.
true_mu1 =-2.0
true_mu2 = 2.0
true_sigma = 2.0
true_gamma = 0.5
import scipy.integrate as sci_intgr
import scipy.optimize as sci_opt
Defining probability functions
def p_xCk(x, mu, sigma):
'Class conditional probability p(x|Ck)'
denom = math.sqrt(2.0 * math.pi * sigma)
arg = -0.5 * (x - mu) * (x - mu) / sigma
return math.exp(arg) / denom
def p_x(x, mu1, mu2, sigma, gamma):
'Marginal probability p(x)'
return gamma * p_xCk(x, mu1, sigma) + (1.0 - gamma) * p_xCk(x, mu2, sigma)
def p_Ckx(x, mu1, mu2, sigma, gamma):
'Posterior class probability vector (p(C_1|x), p(C_2|x))'
a = math.log(p_xCk(x, mu1, sigma)*gamma/(p_xCk(x,mu2,sigma)*(1-gamma)))
pc1 = 1.0/(1.0 + math.exp(-a))
return (pc1, 1.0 - pc1)
def cdf(x, mu1, mu2, sigma, gamma):
'Cumulative distribution function P(x<X)'
return sci_intgr.quad(func=p_x, a=-np.inf, b=x, args=(mu1, mu2, sigma, gamma))
def inv_cdf(y, mu1, mu2, sigma, gamma):
'Inverse of the CDF'
f = lambda x: cdf(x,mu1,mu2,sigma,gamma)[0] - y
return sci_opt.newton(f, 0)
Class conditional probabilities
domain = np.linspace(-5, 5, 100)
px_class1 = [p_xCk(x, true_mu1, true_sigma) for x in domain]
px_class2 = [p_xCk(x, true_mu2, true_sigma) for x in domain]
fig = plt.figure(figsize=(5,4))
plt.plot(domain, px_class1, label='Class 1')
plt.plot(domain, px_class2, label='Class 2')
plt.xlabel('$x$'); plt.ylabel('$p(x|C_k)$');plt.legend(bbox_to_anchor=(1.37,1), fancybox=True);
plt.title('Class conditional probability - $p(x|C_k)$');
Marginal distribution of $x$
px = [p_x(x, true_mu1, true_mu2, true_sigma, true_gamma) for x in domain]
fig = plt.figure(figsize=(5,4))
plt.plot(domain,px)
plt.xlabel('$x$');plt.ylabel('$p(x)$');plt.title('Marginal distribution - $p(x)$');
Posterior distribution
pc1x, pc2x = [], []
for x in domain:
pck = p_Ckx(x, true_mu1, true_mu2, true_sigma, true_gamma)
pc1x.append(pck[0]); pc2x.append(pck[1])
fig = plt.figure(figsize=(5,4))
plt.plot(domain, pc1x, label='Class 1')
plt.plot(domain, pc2x, label='Class 2')
plt.xlabel('$x$');plt.ylabel('$p(C_k|x)$');
plt.title('Posterior distributions - $p(C_1|x)$ and $p(C_2|x)$');
Or model is ready, let's generate some data and test it.
np.random.seed(123456789) # do not do this in real life
num_samples = 1000
x = np.zeros(num_samples)
t = np.zeros(num_samples)
pcx = np.zeros(num_samples)
n1 = 0
nae = 0
assignment_epsilon = 0.5
Assigning x
to 1
for class 1 and 0
for class 2.
for i in range(num_samples):
rv = np.random.uniform()
x[i] = inv_cdf(rv, true_mu1, true_mu2, true_sigma, true_gamma)
pcx1 = p_Ckx(x[i], true_mu1, true_mu2, true_sigma, true_gamma)
pcx2 = pcx1[1]
pcx1 = pcx1[0]
#we don't want a perfect dividing line for our domain, otherwise why would we need a learning algorithim?
if math.fabs(pcx2-pcx1) <= assignment_epsilon:
nae = nae + 1
if np.random.uniform() <= 0.5:
t[i] = 1
n1 = n1 + 1
else:
t[i] = 0
elif pcx1 > pcx2:
t[i] = 1
n1 = n1 + 1
else: t[i]=0
fig = plt.figure(figsize=(5,4))
plt.scatter(x, t, marker='.', alpha=0.1)
plt.axvline(0, linestyle='--')
plt.xlabel('$x$');plt.ylabel('Class ($t$)'); plt.title('Simulated training data');
n2 = num_samples - n1
e_gamma = n1/num_samples
e_mu1 = 0
for x_i, t_i in zip(x,t):
if t_i == 1:
e_mu1 += x_i
# e_mu1 += x_i*t_i
e_mu1 /= n1
e_mu2 = 0
for x_i, t_i in zip(x,t):
if t_i == 0:
e_mu2 += x_i
# e_mu2 += x_i*(1-t_i)
e_mu2 /= n2
def Sn(Z, ck, mu=(e_mu1,e_mu2)):
idx = 1-int(ck)
return (Z[0]-mu[idx]) * (Z[0]-mu[idx])
e_sigma = 0
for x_i, t_i in zip(x,t):
e_sigma += Sn((x_i, t_i), t_i)
e_sigma /= float(num_samples)
print('The number of inputs in C1 is {0}'.format(n1))
print('The number of inputs that were assigned based on the assignment factor is {0}'.format(nae))
print('True model parameters: gamma={0}, µ1={1}, µ2={2}, ∑={3}'.format(true_gamma,
true_mu1, true_mu2,
true_sigma))
print('Estimated model parameters: gamma={0}, µ1={1}, µ2={2}, ∑={3}'.format(e_gamma, e_mu1,
e_mu2, e_sigma))
The number of inputs in C1 is 505 The number of inputs that were assigned based on the assignment factor is 121 True model parameters: gamma=0.5, µ1=-2.0, µ2=2.0, ∑=2.0 Estimated model parameters: gamma=0.505, µ1=-2.069287615754436, µ2=2.095874429059065, ∑=1.7242111261175646
Probablistic discriminative models assume the same generalized linear model
$$y(\phi(\vec{x})) = f(\vec{w}^\intercal\phi(\vec{x}))$$as in the probabilistic generative models.
The key difference is the form of the likelihood function.
$p\left(\phi(\vec{x}),C_k\right)=p\left(\phi(\vec{x})\right|\left.C_k\right)p(C_k)$.
with
$$p\left(C_2\right|\left.\phi(\vec{x})\right) = 1 - p\left(C_1\right|\left.\phi(\vec{x})\right).$$and $i=1,\ldots, m$.
Defining the error function, $E(\vec{w})$, as the negative of the log-likelihood function, and taking the gradient with respect to $\vec{w}$, we obtain $$\bigtriangledown E(\vec{w}) = \sum_{i=1}^m \left[\sigma(\vec{w}^\intercal \phi(\vec{x}_i)) - t_i\right] \phi(\vec{x}_i).$$
Here we will consider the Newton-Raphson method for which minimizing the error function takes the form
$$\vec{w}^{(\tau+1)} = \vec{w}^{(\tau)} - \vec{H}^{-1}\bigtriangledown E(\vec{w}),$$where $\vec{H}$ is the Hessian matrix composed of the second derivatives of the error function
$$\vec{H} = \bigtriangledown \bigtriangledown E(\vec{w}) = \Phi^\intercal \vec{R} \Phi,$$where $\Phi$ is the $n \times m$ design matrix whose $n$-th row is given by $\phi(\vec{x_n})^\intercal$ and $\vec{R}$ is an $n \times n$ diagonal matrix with elements.
$$R_{i,i} = \sigma(\vec{w}^\intercal \phi(\vec{x}_i)) \left[1-\sigma(\vec{w}^\intercal \phi(\vec{x}_i))\right].$$This can be reduced to a form equivalent to that of locally weighted linear regression as follows
$$\vec{w}^{(\tau+1)} = \left( \Phi^T \vec{R} \Phi \right)^{-1} \Phi^T \vec{R} \vec{z}$$where $\vec{z}$ is an $n$-dimensional vector defined by
$$\vec{z} = \Phi \vec{w}^{(\tau)} - \vec{R}^{-1}(\vec{y} - \vec{t})$$# preparing training dataset
np.random.seed(123456789) # fixing for reproducubility
N = 100 #number of data points
D = 2 #dimension of input vector
t = np.zeros(N) #training set classifications
X = np.zeros((N,D)) #training data in input space
sigma = .25
mu0 = 0.0
mu1 = 1.0
for i in range(N):
#choose class to sample for
fac = 1
if np.random.rand() <= 0.5:
thismu = mu0
t[i] = 1
else:
thismu = mu1
t[i] = 0
if np.random.rand() < 0.5: fac = -1
X[i,0] = fac * np.random.normal(thismu, sigma)
X[i,1] = fac * np.random.normal(thismu, sigma)
fig = plt.figure(figsize=(5, 5))
ax = fig.gca()
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_title('Training data set')
create_scatter(X, t, ax)
The training data does not have a linear boundary in the original input space.
phi(...)
) to try to make it linearly separabledef phi(x, mu, sigma):
detSigma = np.linalg.det(sigma)
fac = math.pow(2 * math.pi, len(mu) / 2.0) * math.sqrt(detSigma)
arg = -0.5 * np.dot((x - mu).T, np.dot(np.linalg.inv(sigma), x - mu))
return math.exp(arg) / fac
phiX = np.zeros((N,D))
MU1 = np.ones(D)*mu0
MU2 = np.ones(D)*mu1
SIGMA = np.diag(np.ones(D))*sigma
for i in range(N):
phiX[i,0] = phi(x=X[i,:], mu=MU2, sigma=SIGMA)
phiX[i,1] = phi(x=X[i,:], mu=MU1, sigma=SIGMA)
fig = plt.figure(figsize=(5,5)); ax =fig.gca()
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
create_scatter(phiX,t,ax)
Now, lets apply machine learning to determine the boundary!
phi_n = [1, phiX[0], phiX[1]]
.M = 3
Phi = np.ones((N,M))
Phi[:,1] = phiX[:,0]
Phi[:,2] = phiX[:,1]
w = np.zeros(M)
R = np.zeros((N,N))
y = np.zeros(N)
def sigmoid(a):
return 1.0 / (1.0 + math.exp(-a))
def totalErr(y,t):
e = 0.0
for i in range(len(y)):
if t[i] > 0:
e += math.log(y[i])
else:
e += math.log(1.0 - y[i])
return -e
Starting Newton-Raphson.
max_its = 100
tol = 1e-2
w0 = [w[0]]
w1 = [w[1]]
w2 = [w[2]]
err = []
error_delta = 1 + tol
current_error = 0
idx = 0
from functools import reduce
while math.fabs(error_delta) > tol and idx < max_its:
#update y & R
for i in range(N):
y[i] = sigmoid(reduce(lambda accum, Z: accum + Z[0]*Z[1], zip(w, Phi[i,:]), 0))
R[i,i] = y[i] - y[i]*y[i]
#update w
z = np.dot(Phi,w) - np.dot(np.linalg.pinv(R),y-t)
term_1 = np.linalg.pinv(np.dot(np.dot(Phi.T,R),Phi))
term_2 = np.dot(np.dot(term_1, Phi.T),R)
w = np.dot(term_2, z)
w0.append(w[0])
w1.append(w[1])
w2.append(w[2])
idx += 1
temp = totalErr(y,t)
error_delta = current_error - temp
current_error = temp
err.append(error_delta)
print('The total number of iterations was {0}'.format(idx))
print('The total error was {0}'.format(current_error))
print('The final change in error was {0}'.format(error_delta))
print('The final parameters were {0}'.format(w))
The total number of iterations was 13 The total error was 0.003482221426895694 The final change in error was 0.00577180168397616 The final parameters were [ -17.844 -14.062 163.065]
Our decision boundary is now formed by the line where $\sigma(a) = 0.5$, i.e. where $a = 0$, which for this example is where $\phi_2 = -\frac{w_1}{w_2}\phi_1$, i.e. where $\vec{w} \cdot\vec{\phi} = 0.5$.
fig = plt.figure(figsize=(5,5)); ax =fig.gca()
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
create_scatter(phiX,t,ax)
bdryx = (-0.2,1.1)
bdryy = (-(w[0]+w[1]*bdryx[0])/w[2], -(w[0]+w[1]*bdryx[1])/w[2])
ax.plot(bdryx, bdryy, linestyle='--');
fig = plt.figure(figsize=(5,5))
plt.plot(w0, 'bo-', label='$w_0$')
plt.plot(w1, 'rx-', label='$w_1$')
plt.plot(w2, 'gs-', label='$w_2$')
plt.plot(err, 'm*-', label='error delta')
plt.legend(loc='upper left', frameon=True)
plt.xlabel('Newton-Raphson iterations');