# Discriminative Classification¶

### Challenge: difficult class-conditional data distributions¶

Our task will be the same as in the preceding class on (generative) classification. But this time, the class-conditional data distributions look very non-Gaussian, yet the linear discriminative boundary looks easy enough:

In [1]:
using Pkg;Pkg.activate("probprog/workspace");Pkg.instantiate()
using Random; Random.seed!(1234);
IJulia.clear_output();

In [2]:
# Generate dataset {(x1,y1),...,(xN,yN)}
# x is a 2-d feature vector [x_1;x_2]
# y ∈ {false,true} is a binary class label
# p(x|y) is multi-modal (mixture of uniform and Gaussian distributions)
using PyPlot
include("./scripts/lesson8_helpers.jl")
N = 200
X, y = genDataset(N) # Generate data set, collect in matrix X and vector y
X_c1 = X[:,findall(.!y)]'; X_c2 = X[:,findall(y)]' # Split X based on class label
X_test = [3.75; 1.0] # Features of 'new' data point
function plotDataSet()
plot(X_c1[:,1], X_c1[:,2], "bx", markersize=8)
plot(X_c2[:,1], X_c2[:,2], "r+", markersize=8, fillstyle="none")
plot(X_test[1], X_test[2], "ko")
xlabel(L"x_1"); ylabel(L"x_2");
legend([L"y=0", L"y=1",L"y=?"], loc=2)
xlim([-2;10]); ylim([-4, 8])
end
plotDataSet();


### Main Idea of Discriminative Classification¶

• Again, a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^M$ and $y_n \in \mathcal{C}_k$, with $k=1,\ldots,K$.
• Sometimes, the precise assumptions of the (multinomial-Gaussian) generative model $$p(x_n,y_n\in\mathcal{C}_k|\theta) = \pi_k \cdot \mathcal{N}(x_n|\mu_k,\Sigma)$$ clearly do not match the data distribution.
• Here's an IDEA! Let's model the posterior $$p(y_n\in\mathcal{C}_k|x_n)$$ directly, without any assumptions on the class densities.
• Of course, this implies also that we build direct models for the discrimination boundaries, which are given by $$\frac{p(y_n\in\mathcal{C}_k|x_n)}{p(y_n\in\mathcal{C}_j|x_n)} \overset{!}{=} 1$$

### Bayesian Logistic Regression¶

• We will work this idea out for a 2-class problem. Assume a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^M$ and $y_n \in \{0,1\}$.
##### Model Specification¶
• What model should we use for the posterior distribution $p(y_n \in \mathcal{C}_k|x_n)$?
• In Logistic Regression, we take inspiration from the generative approach, where the softmax function "emerged" as the posterior. Here, we choose the 2-class softmax function (which is called the logistic function) with linear discrimination bounderies for the posterior class probability: $$p(y_n =1 \,|\, x_n, w) = \sigma(w^T x_n) \,.$$ where $$\sigma(a) = \frac{1}{1+e^{-a}}$$ is the logistic function.

• (Bishop fig.4.9). The logistic function $\sigma(a) = 1/(1+e^{-a})$ (red), together with the scaled probit function $\Phi(\lambda a)$, for $\lambda^2=\pi/8$ (in blue). We will use this approximation later in the Laplace approximation.
• Indeed, for this choice of posterior class probabilities, the discrimination boundary is a straight line, see Exercises.
• Adding the other class ($y_n=0$) leads to the following posterior class distribution: \begin{align*} p(y_n \,|\, x_n, w) &= \mathrm{Bernoulli}\left(y_n \,|\, \sigma(w^T x_n) \right) \\ &= \sigma(w^T x_n)^{y_n} \left(1 - \sigma(w^T x_n)\right)^{(1-y_n)} \tag{B-4.89} \\ &= \sigma\left( (2y_n-1) w^T x_n\right) \end{align*}
• Note that for the 3rd equality, we have made use of the fact that $\sigma(-a) = 1-\sigma(a)$.
• Each of these three models in B-4.89 are equivalent. We mention all three notational options since they all appear in the literature.
• This choice for the class posterior is called logistic regression, in analogy to linear regression (where $p(y_n|x_n,w) = \mathcal{N}(y_n|w^T x_n,\sigma^2)$).
• Note that in this model specification, we do not impose a Gaussian structure on the class features. In the discriminative approach, the parameters $w$ are not structured into $\{\mu,\Sigma,\pi \}$. This provides discriminative approach with more flexibility than the generative approach.
• In Bayesian logistic regression, we add a Gaussian prior on the weights: \begin{align*} p(w) = \mathcal{N}(w \,|\, m_0, S_0) \tag{B-4.140} \end{align*}
##### Inference¶
• The posterior for the weights follows by Bayes rule \begin{align*} \underbrace{p(w \,|\, D)}_{\text{posterior}} &\propto p(w) p(D|w) \\ &= \underbrace{\mathcal{N}(w \,|\, m_0, S_0)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \sigma\left( (2y_n-1) w^T x_n\right)}_{\text{likelihood}} \tag{B-4.142} \end{align*}

• In principle, Bayesian inference is done now. Unfortunately, the posterior is not Gaussian and the evidence $p(D)$ is also not analytically computable. (We will deal with this later).

##### Predictive distribution¶
• For a new data point $x_\bullet$, the predictive distribution for $y_\bullet$ is given by \begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &= \int p(y_\bullet = 1 \,|\, x_\bullet, w) \, p(w\,|\, D) \,\mathrm{d}w \\ &= \int \sigma(w^T x_\bullet) \, p(w\,|\, D) \,\mathrm{d}w \tag{B-4.145} \end{align*}
• After substitution of $p(w | D)$ from B-4.142, we have an integral that is not solvable in closed-form.
• Many methods have been developed to approximate the integrals for the predictive distribution and evidence. Here, we present the Laplace approximation, which is one of the simplest methods with broad applicability to Bayesian calculations.

### The Laplace Approximation¶

• The central idea of the Laplace approximation is to approximate a (possibly unnormalized) distribution $f(z)$ by a Gaussian distribution $q(z)$.
• Note that $\log q(z)$ is a second-order polynomial in $z$, so we will find the Gaussian by fitting a parabola to $\log f(z)$.
##### Example¶

• (Bishop fig.4.14a). Laplace approximation (in red) to the distribution $p(z)\propto \exp(-z^2/2)\sigma(20z+4)$, where $\sigma(a)=1/(1+e^{-a})$. The Laplace approximation is centered on the mode of $p(z)$.

### Working out the Laplace Approximation¶

##### estimation of mean¶
• The mean ($z_0$) of $q(z)$ is placed on the mode of $\log f(z)$, i.e.,
$$z_0 = \arg\max_z \log f(z) \tag{B-4.126}$$

##### estimation of precision matrix¶
• Since the gradient $\nabla \left. f(z) \right|_{z=z_0}$ vanishes at the mode, we can (Taylor) expand $\log f(z)$ around $z=z_0$ as \begin{align*} \log f(z) &\approx \log f(z_0) + \overbrace{\left(\nabla \log f(z_0)\right)^T (z-z_0)}^{=0 \text{ at }z=z_0} + \ldots \\ &\qquad + \frac{1}{2} (z-z_0)^T \left(\nabla \nabla \log f(z_0)\right) (z-z_0) \\ &= \log f(z_0) - \frac{1}{2} (z-z_0)^T A (z-z_0) \tag{B-4.131} \end{align*} where the Hessian matrix $A$ is defined by $$A = - \nabla \nabla \left. \log f(z) \right|_{z=z_0} \tag{B-4.132}$$
##### Laplace approximation construction¶
• After taking exponentials in eq. B-4.131, we obtain
$$f(z) \approx f(z_0) \exp\left( - \frac{1}{2} (z-z_0)^T A (z-z_0)\right)$$
• We can now identify $q(z)$ as $$q(z) = \mathcal{N}\left( z\,|\,z_0, A^{-1}\right) \tag{B-4.134}$$ with $z_0$ and $A$ defined by eqs. B-4.126 and B-4.132.

### Bayesian Logistic Regression with the Laplace Approximation¶

• Let's get back to the challenge of computing the predictive class distribution (B-4.145) for Bayesian logistic regression. We first work out the Gaussian Laplace approximation $q(w)$ to the posterior weight distribution \begin{align*} \underbrace{p(w | D)}_{\text{posterior}} \propto \underbrace{\mathcal{N}(w \,|\, m_0, S_0)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \sigma\left( (2y_n-1) w^T x_n\right)}_{\text{likelihood}} \tag{B-4.142} \end{align*}
##### A Gausian Laplace approximation to the weights posterior¶
• Since we have a differentiable expression for $\log p(w | D)$, it is straightforward to compute the gradient and Hessian: \begin{align*} \nabla_w \log p(w | D) &= S_0^{-1}\cdot \left(m_0-w\right) + \sum_n (2y_n-1) (1-\sigma_n) x_n \\ \nabla_w \nabla_w \log p(w | D) &= -S_0^{-1} - \sum_n \sigma_n (1-\sigma_n) x_n x_n^T \tag{B-4.143} \end{align*} where we used shorthand $\sigma_n$ for $\sigma\left( (2y_n-1) w^T x_n\right)$.
• We can now use the gradient $\nabla_w \log p(w | D)$ to find the mode $w_{N}$ of $\log p(w|D)$ (eg by some gradient-based optimization procedure) and then use the Hessian $\nabla_w \nabla_w \log p(w | D)$ to get the variance of $q(w)$, leading to a **Gaussian approximate weights posterior**: $$q(w) = \mathcal{N}\left(w\,|\, w_{N}, S_N\right) \tag{B-4.144}$$ with $$S_N^{-1} = S_0^{-1} + \sum_n \sigma_n (1-\sigma_n) x_n x_n^T \tag{B-4.143}$$

### Using the Laplace-approximated parameter posterior to evaluate the predictive distribution¶

• In the analytically unsolveable expressions for evidence and the predictive distribution (estimating the class of a new observation), we proceed with using the Laplace approximation to the weights posterior. For a new observation $x_\bullet$, the class probability is now \begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &= \int p(y_\bullet = 1 \,|\, x_\bullet, w) \cdot p(w\,|\, D) \,\mathrm{d}w \\ &\approx \int p(y_\bullet = 1 \,|\, x_\bullet, w) \cdot \underbrace{q(w)}_{\text{Gaussian}} \,\mathrm{d}w \\ &= \int \sigma(w^T x_\bullet) \cdot \mathcal{N}\left(w \,|\, w_N, S_N\right) \,\mathrm{d}w \tag{B-4.145} \end{align*}
• This looks better but we need two more clever tricks to evaluate this expression.
1. First, note that $w$ appears in $\sigma(w^T x_\bullet)$ as an inner product, so through substitution of $a:=w^T x_\bullet$, the expression simplifies to an integral over the scalar $a$ (see Bishop for derivation): \begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &\approx \int \sigma(a) \, \mathcal{N}\left(a\,|\, \mu_a, \sigma_a^2\right) \,\mathrm{d}a \tag{B-4.151}\\ \mu_a &= w^T_{N} x_\bullet \tag{B-4.149}\\ \sigma_a^2 &= x^T_\bullet S_N x_\bullet \tag{B-4.150} \end{align*}
2. Secondly, while the integral of the product of a logistic function with a Gaussian is not analytically solvable, the integral of the product of a Gaussian cumulative distribution function (CDF, also known as the probit function) with a Gaussian does have a closed-form solution. Fortunately, $$\Phi(\lambda a) \approx \sigma(a)$$ with the Gaussian CDF $\Phi(x)= \frac{1}{\sqrt(2\pi)}\int_{-\infty}^{x}e^{-t^2/2}\mathrm{d}t$, $\lambda^2= \pi / 8$ and $\sigma(a) = 1/(1+e^{-a})$. Thus, substituting $\Phi(\lambda a)$ with $\lambda^2= \pi / 8$ for $\sigma(a)$ leads to
\begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &= \int \sigma(w^T x_\bullet) \cdot p(w|D) \,\mathrm{d}w \\ &\approx \int \underbrace{\Phi(\lambda a)}_{\text{probit function}} \cdot \underbrace{\mathcal{N}\left(a\,|\, \mu_a, \sigma_a^2\right)}_{\text{Gaussian}} \,\mathrm{d}a \\ &= \Phi\left( \frac{\mu_a}{\sqrt(\lambda^{-2} +\sigma_a^2)}\right) \tag{B-4.152} \end{align*}
• We now have an approximate but closed-form expression for the predictive class distribution for a new observation with a Bayesian logistic regression model.
• Note that, by Eq.B-4.143, the variance $S_N$ (and consequently $\sigma_a^2$) for the weight vector depends on the distribution of the training set. Large uncertainty about the weights (in areas with little training data and uninformative prior variance $S_0$) takes the posterior class probability eq. B-4.152 closer to $0.5$. Does that make sense?
• Apparently, the Laplace approximation leads to a closed-form solutions for Bayesian logistic regression (although admittedly, the derivation is no walk in the park).

### ML Estimation for Discriminative Classification¶

• Rather than the computationally involved Bayesian method, in practice, discriminative classification is often executed through maximum likelihood estimation.
• With the usual 1-of-K encoding scheme for classes ($y_{nk}=1$ if $x_n \in \mathcal{C}_k$, otherwise $y_{nk}=0$), the log-likelihood for a $K$-dimensional discriminative classifier is

\begin{align*} \mathrm{L}(\theta) &= \log \prod_n \prod_k {p(\mathcal{C}_k|x_n,\theta)}^{y_{nk}} \\ &= \log \prod_n \prod_k \Bigg(\underbrace{\frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}}}_{\text{softmax function}}\Bigg)^{y_{nk}} \\ &= \sum_n \sum_k y_{kn} \log \big( \frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}} \big) \end{align*}

$$\nabla_{\theta_k} \mathrm{L}(\theta) = \sum_n \underbrace{\big( \underbrace{y_{nk}}_{\text{target}} - \underbrace{\frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}}}_{\text{prediction}} \big)}_{\text{prediction error}}\cdot x_n$$
$$\nabla_\theta \mathrm{L}(\theta) = \sum_n \left(y_n - \theta^T x_n \right) x_n$$
• In both cases
$$\nabla_\theta \mathrm{L} = \sum_n \left( \text{target}_n - \text{prediction}_n \right) \cdot \text{input}_n$$
• The parameter vector $\theta$ for logistic regression can be estimated through iterative gradient-based adaptation. E.g. (with iteration index $i$), $$\hat{\theta}^{(i+1)} = \hat{\theta}^{(i)} + \eta \cdot \left. \nabla_\theta \mathrm{L}(\theta) \right|_{\theta = \hat{\theta}^{(i)}}$$
• Note that, while in the Bayesian approach we get to update $\theta$ with precision-weighted prediction errors) (which is optimal), in the maximum likelihood approach, we weigh the prediction errors with input values (which is less precise).

#### CODE EXAMPLE¶

Let us perform ML estimation of $w$ on the data set from the introduction. To allow an offset in the discrimination boundary, we add a constant 1 to the feature vector $x$. We only have to specify the (negative) log-likelihood and the gradient w.r.t. $w$. Then, we use an off-the-shelf optimisation library to minimize the negative log-likelihood.

We plot the resulting maximum likelihood discrimination boundary. For comparison we also plot the ML discrimination boundary obtained from the code example in the generative Gaussian classifier lesson.

In [3]:
using Optim # Optimization library

y_1 = zeros(length(y))# class 1 indicator vector
y_1[findall(y)] .= 1
X_ext = vcat(X, ones(1, length(y))) # Extend X with a row of ones to allow an offset in the discrimination boundary

# Implement negative log-likelihood function
function negative_log_likelihood(θ::Vector)
# Return negative log-likelihood: -L(θ)
p_1 = 1.0 ./ (1.0 .+ exp.(-X_ext' * θ))   # P(C1|X,θ)
return -sum(log.( (y_1 .* p_1) + ((1 .- y_1).*(1 .- p_1))) ) # negative log-likelihood
end

# Use Optim.jl optimiser to minimize the negative log-likelihood function w.r.t. θ
results = optimize(negative_log_likelihood, zeros(3), LBFGS())
θ = results.minimizer

# Plot the data set and ML discrimination boundary
plotDataSet()
p_1(x) = 1.0 ./ (1.0 .+ exp(-([x;1.]' * θ)))
boundary(x1) = -1 ./ θ[2] * (θ[1]*x1 .+ θ[3])
plot([-2.;10.], boundary([-2.; 10.]), "k-");
# # Also fit the generative Gaussian model from lesson 7 and plot the resulting discrimination boundary for comparison
generative_boundary = buildGenerativeDiscriminationBoundary(X, y)
plot([-2.;10.], generative_boundary([-2;10]), "k:");
legend([L"y=0";L"y=1";L"y=?";"Discr. boundary";"Gen. boundary"], loc=3);

# Given $\hat{\theta}$, we can classify a new input $x_\bullet = [3.75, 1.0]^T$:

x_test = [3.75;1.0]
println("P(C1|x•,θ) = $(p_1(x_test))")  P(C1|x•,θ) = 0.9999999132138306  • The generative model gives a bad result because the feature distribution of one class is clearly non-Gaussian: the model does not fit the data well. • The discriminative approach does not suffer from this problem because it makes no assumptions about the feature distribition$p(x|y)$, it just estimates the conditional class distribution$p(y|x)$directly. ### Recap Classification¶  Generative Discriminative (ML) 1 Like density estimation, model joint prob. $$p(\mathcal{C}_k) p(x|\mathcal{C}_k) = \pi_k \mathcal{N}(\mu_k,\Sigma)$$ Like (linear) regression, model conditional $$p(\mathcal{C}_k|x,\theta)$$ 2 Leads to softmax posterior class probability $$p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ with structured$\theta$Choose also softmax posterior class probability $$p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ but now with 'free'$\theta$3 For Gaussian$p(x|\mathcal{C}_k)$and multinomial priors, $$\hat \theta_k = \left[ {\begin{array}{c} { - \frac{1}{2} \mu_k^T \sigma^{-1} \mu_k + \log \pi_k} \\ {\sigma^{-1} \mu_k } \\ \end{array}} \right]$$ in one shot. Find$\hat\theta_k$through gradient-based adaptation $$\nabla_{\theta_k}\mathrm{L}(\theta) = \sum_n \Big( y_{nk} - \frac{e^{\theta_k^T x_n}}{\sum_{k^\prime} e^{\theta_{k^\prime}^T x_n}} \Big)\, x_n$$ ## OPTIONAL SLIDES ¶ ### Proof of Derivative of Log-likelihood for Logistic Regression¶ • The Log-likelihood is$ \mathrm{L}(\theta) = \log \prod_n \prod_k {\underbrace{p(\mathcal{C}_k|x_n,\theta)}_{p_{nk}}}^{y_{nk}} = \sum_{n,k} y_{nk} \log p_{nk}$• Use the fact that the softmax$\phi_k \equiv e^{a_k} / {\sum_j e^{a_j}}has analytical derivative: \begin{align*} \frac{\partial \phi_k}{\partial a_j} &= \frac{(\sum_j e^{a_j})e^{a_k}\delta_{kj}-e^{a_j}e^{a_k}}{(\sum_j e^{a_j})^2} = \frac{e^{a_k}}{\sum_j e^{a_j}}\delta_{kj} - \frac{e^{a_j}}{\sum_j e^{a_j}} \frac{e^{a_k}}{\sum_j e^{a_j}}\\ &= \phi_k \cdot(\delta_{kj}-\phi_j) \end{align*} • Take the derivative of\mathrm{L}(\theta)\$ (or: how to spend a hour ...) \begin{align*} \nabla_{\theta_j} \mathrm{L}(\theta) &= \sum_{n,k} \frac{\partial \mathrm{L}_{nk}}{\partial p_{nk}} \cdot\frac{\partial p_{nk}}{\partial a_{nj}}\cdot\frac{\partial a_{nj}}{\partial \theta_j} \\ &= \sum_{n,k} \frac{y_{nk}}{p_{nk}} \cdot p_{nk} (\delta_{kj}-p_{nj}) \cdot x_n \\ &= \sum_n \Big( y_{nj} (1-p_{nj}) -\sum_{k\neq j} y_{nk} p_{nj} \Big) \cdot x_n \\ &= \sum_n \left( y_{nj} - p_{nj} \right)\cdot x_n \\ &= \sum_n \Big( \underbrace{y_{nj}}_{\text{target}} - \underbrace{\frac{e^{\theta_j^T x_n}}{\sum_{j^\prime} e^{\theta_{j^\prime}^T x_n}}}_{\text{prediction}} \Big)\cdot x_n \end{align*}
In [4]:
open("../../styles/aipstyle.html") do f