- Goal
- Introduction to discriminative classification models

- Materials
- Mandatory
- These lecture notes

- Optional
- Bishop pp. 213 - 217 (Laplace approximation)
- Bishop pp. 217 - 220 (Bayesian logistic regression)
- T. Minka (2005), Discriminative models, not discriminative training

- Mandatory

Our task will be the same as in the preceding class on (generative) classification. But this time, the class-conditional data distributions look very non-Gaussian, yet the linear discriminative boundary looks easy enough:

In [1]:

```
using Pkg;Pkg.activate("probprog/workspace");Pkg.instantiate()
using Random; Random.seed!(1234);
IJulia.clear_output();
```

In [2]:

```
# Generate dataset {(x1,y1),...,(xN,yN)}
# x is a 2-d feature vector [x_1;x_2]
# y ∈ {false,true} is a binary class label
# p(x|y) is multi-modal (mixture of uniform and Gaussian distributions)
using PyPlot
include("./scripts/lesson8_helpers.jl")
N = 200
X, y = genDataset(N) # Generate data set, collect in matrix X and vector y
X_c1 = X[:,findall(.!y)]'; X_c2 = X[:,findall(y)]' # Split X based on class label
X_test = [3.75; 1.0] # Features of 'new' data point
function plotDataSet()
plot(X_c1[:,1], X_c1[:,2], "bx", markersize=8)
plot(X_c2[:,1], X_c2[:,2], "r+", markersize=8, fillstyle="none")
plot(X_test[1], X_test[2], "ko")
xlabel(L"x_1"); ylabel(L"x_2");
legend([L"y=0", L"y=1",L"y=?"], loc=2)
xlim([-2;10]); ylim([-4, 8])
end
plotDataSet();
```

- Again, a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^M$ and $y_n \in \mathcal{C}_k$, with $k=1,\ldots,K$.

- Sometimes, the precise assumptions of the (multinomial-Gaussian) generative model $$p(x_n,y_n\in\mathcal{C}_k|\theta) = \pi_k \cdot \mathcal{N}(x_n|\mu_k,\Sigma)$$ clearly do not match the data distribution.

- Here's an
**IDEA**! Let's model the posterior $$p(y_n\in\mathcal{C}_k|x_n)$$*directly*, without any assumptions on the class densities.

- Of course, this implies also that we build direct models for the
**discrimination boundaries**, which are given by $$\frac{p(y_n\in\mathcal{C}_k|x_n)}{p(y_n\in\mathcal{C}_j|x_n)} \overset{!}{=} 1$$

- We will work this idea out for a 2-class problem. Assume a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^M$ and $y_n \in \{0,1\}$.

- What model should we use for the posterior distribution $p(y_n \in \mathcal{C}_k|x_n)$?

- In Logistic Regression, we take inspiration from the generative approach, where the
**softmax**function "emerged" as the posterior. Here, we**choose**the 2-class softmax function (which is called the**logistic**function) with linear discrimination bounderies for the posterior class probability: $$ p(y_n =1 \,|\, x_n, w) = \sigma(w^T x_n) \,. $$ where $$\sigma(a) = \frac{1}{1+e^{-a}}$$ is the*logistic*function.

- (Bishop fig.4.9). The logistic function $\sigma(a) = 1/(1+e^{-a})$ (red), together with the scaled probit function $\Phi(\lambda a)$, for $\lambda^2=\pi/8$ (in blue). We will use this approximation later in the Laplace approximation.

- Indeed, for this choice of posterior class probabilities, the discrimination boundary is a straight line, see Exercises.

- Adding the other class ($y_n=0$) leads to the following posterior class distribution:
$$\begin{align*}
p(y_n \,|\, x_n, w) &= \mathrm{Bernoulli}\left(y_n \,|\, \sigma(w^T x_n) \right) \\
&= \sigma(w^T x_n)^{y_n} \left(1 - \sigma(w^T x_n)\right)^{(1-y_n)} \tag{B-4.89} \\
&= \sigma\left( (2y_n-1) w^T x_n\right)
\end{align*}$$
- Note that for the 3rd equality, we have made use of the fact that $\sigma(-a) = 1-\sigma(a)$.
- Each of these three models in B-4.89 are
**equivalent**. We mention all three notational options since they all appear in the literature.

- This choice for the class posterior is called
**logistic regression**, in analogy to**linear regression**(where $p(y_n|x_n,w) = \mathcal{N}(y_n|w^T x_n,\sigma^2)$).

- Note that in this model specification, we do not impose a Gaussian structure on the class features. In the discriminative approach, the parameters $w$ are
**not**structured into $\{\mu,\Sigma,\pi \}$. This provides discriminative approach with more flexibility than the generative approach.

- In
*Bayesian*logistic regression, we add a**Gaussian prior on the weights**: $$\begin{align*} p(w) = \mathcal{N}(w \,|\, m_0, S_0) \tag{B-4.140} \end{align*}$$

The posterior for the weights follows by Bayes rule $$\begin{align*} \underbrace{p(w \,|\, D)}_{\text{posterior}} &\propto p(w) p(D|w) \\ &= \underbrace{\mathcal{N}(w \,|\, m_0, S_0)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \sigma\left( (2y_n-1) w^T x_n\right)}_{\text{likelihood}} \tag{B-4.142} \end{align*}$$

In principle, Bayesian inference is done now. Unfortunately, the posterior is not Gaussian and the evidence $p(D)$ is also not analytically computable. (We will deal with this later).

- For a new data point $x_\bullet$, the predictive distribution for $y_\bullet$ is given by $$\begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &= \int p(y_\bullet = 1 \,|\, x_\bullet, w) \, p(w\,|\, D) \,\mathrm{d}w \\ &= \int \sigma(w^T x_\bullet) \, p(w\,|\, D) \,\mathrm{d}w \tag{B-4.145} \end{align*}$$

- After substitution of $p(w | D)$ from B-4.142, we have an integral that is not solvable in closed-form.

- Many methods have been developed to approximate the integrals for the predictive distribution and evidence. Here, we present the
**Laplace approximation**, which is one of the simplest methods with broad applicability to Bayesian calculations.

- The central idea of the Laplace approximation is to approximate a (possibly unnormalized) distribution $f(z)$ by a Gaussian distribution $q(z)$.

- Note that $\log q(z)$ is a second-order polynomial in $z$, so we will find the Gaussian by fitting a parabola to $\log f(z)$.

- (Bishop fig.4.14a). Laplace approximation (in red) to the distribution $p(z)\propto \exp(-z^2/2)\sigma(20z+4)$, where $\sigma(a)=1/(1+e^{-a})$. The Laplace approximation is centered on the mode of $p(z)$.

- Since the gradient $\nabla \left. f(z) \right|_{z=z_0}$ vanishes at the mode, we can (Taylor) expand $\log f(z)$ around $z=z_0$ as $$\begin{align*} \log f(z) &\approx \log f(z_0) + \overbrace{\left(\nabla \log f(z_0)\right)^T (z-z_0)}^{=0 \text{ at }z=z_0} + \ldots \\ &\qquad + \frac{1}{2} (z-z_0)^T \left(\nabla \nabla \log f(z_0)\right) (z-z_0) \\ &= \log f(z_0) - \frac{1}{2} (z-z_0)^T A (z-z_0) \tag{B-4.131} \end{align*}$$ where the Hessian matrix $A$ is defined by $$ A = - \nabla \nabla \left. \log f(z) \right|_{z=z_0} \tag{B-4.132} $$

- After taking exponentials in eq. B-4.131, we obtain

- We can now identify $q(z)$ as $$ q(z) = \mathcal{N}\left( z\,|\,z_0, A^{-1}\right) \tag{B-4.134} $$ with $z_0$ and $A$ defined by eqs. B-4.126 and B-4.132.

- Let's get back to the challenge of computing the predictive class distribution (B-4.145) for Bayesian logistic regression. We first work out the Gaussian Laplace approximation $q(w)$ to the posterior weight distribution $$\begin{align*} \underbrace{p(w | D)}_{\text{posterior}} \propto \underbrace{\mathcal{N}(w \,|\, m_0, S_0)}_{\text{prior}} \cdot \underbrace{\prod_{n=1}^N \sigma\left( (2y_n-1) w^T x_n\right)}_{\text{likelihood}} \tag{B-4.142} \end{align*}$$

- Since we have a differentiable expression for $\log p(w | D)$, it is straightforward to compute the gradient and Hessian: $$\begin{align*} \nabla_w \log p(w | D) &= S_0^{-1}\cdot \left(m_0-w\right) + \sum_n (2y_n-1) (1-\sigma_n) x_n \\ \nabla_w \nabla_w \log p(w | D) &= -S_0^{-1} - \sum_n \sigma_n (1-\sigma_n) x_n x_n^T \tag{B-4.143} \end{align*}$$ where we used shorthand $\sigma_n$ for $\sigma\left( (2y_n-1) w^T x_n\right)$.

- We can now use the gradient $\nabla_w \log p(w | D)$ to find the
**mode**$w_{N}$ of $\log p(w|D)$ (eg by some gradient-based optimization procedure) and then use the Hessian $\nabla_w \nabla_w \log p(w | D)$ to get the variance of $q(w)$, leading to a **Gaussian approximate weights posterior**: $$ q(w) = \mathcal{N}\left(w\,|\, w_{N}, S_N\right) \tag{B-4.144} $$ with $$ S_N^{-1} = S_0^{-1} + \sum_n \sigma_n (1-\sigma_n) x_n x_n^T \tag{B-4.143} $$

- In the analytically unsolveable expressions for evidence and the predictive distribution (estimating the class of a new observation), we proceed with using the Laplace approximation to the weights posterior. For a new observation $x_\bullet$, the class probability is now $$\begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &= \int p(y_\bullet = 1 \,|\, x_\bullet, w) \cdot p(w\,|\, D) \,\mathrm{d}w \\ &\approx \int p(y_\bullet = 1 \,|\, x_\bullet, w) \cdot \underbrace{q(w)}_{\text{Gaussian}} \,\mathrm{d}w \\ &= \int \sigma(w^T x_\bullet) \cdot \mathcal{N}\left(w \,|\, w_N, S_N\right) \,\mathrm{d}w \tag{B-4.145} \end{align*}$$

- This looks better but we need two more clever tricks to evaluate this expression.
- First, note that $w$ appears in $\sigma(w^T x_\bullet)$ as an inner product, so through substitution of $a:=w^T x_\bullet$, the expression simplifies to an integral over the scalar $a$ (see Bishop for derivation): $$\begin{align*} p(y_\bullet = 1 \mid x_\bullet, D) &\approx \int \sigma(a) \, \mathcal{N}\left(a\,|\, \mu_a, \sigma_a^2\right) \,\mathrm{d}a \tag{B-4.151}\\ \mu_a &= w^T_{N} x_\bullet \tag{B-4.149}\\ \sigma_a^2 &= x^T_\bullet S_N x_\bullet \tag{B-4.150} \end{align*}$$
- Secondly, while the integral of the product of a logistic function with a Gaussian is not analytically solvable, the integral of the product of a Gaussian cumulative distribution function (CDF, also known as the probit function) with a Gaussian
*does*have a closed-form solution. Fortunately, $$\Phi(\lambda a) \approx \sigma(a)$$ with the Gaussian CDF $\Phi(x)= \frac{1}{\sqrt(2\pi)}\int_{-\infty}^{x}e^{-t^2/2}\mathrm{d}t$, $ \lambda^2= \pi / 8 $ and $\sigma(a) = 1/(1+e^{-a})$. Thus, substituting $\Phi(\lambda a)$ with $ \lambda^2= \pi / 8 $ for $\sigma(a)$ leads to

- We now have an approximate but
**closed-form expression for the predictive class distribution for a new observation**with a Bayesian logistic regression model.

- Note that, by Eq.B-4.143, the variance $S_N$ (and consequently $\sigma_a^2$) for the weight vector depends on the distribution of the training set. Large uncertainty about the weights (in areas with little training data and uninformative prior variance $S_0$) takes the posterior class probability eq. B-4.152 closer to $0.5$. Does that make sense?

- Apparently, the Laplace approximation leads to a closed-form solutions for Bayesian logistic regression (although admittedly, the derivation is no walk in the park).

- Rather than the computationally involved Bayesian method, in practice, discriminative classification is often executed through maximum likelihood estimation.

With the usual 1-of-K encoding scheme for classes ($y_{nk}=1$ if $x_n \in \mathcal{C}_k$, otherwise $y_{nk}=0$), the log-likelihood for a $K$-dimensional discriminative classifier is

$$\begin{align*} \mathrm{L}(\theta) &= \log \prod_n \prod_k {p(\mathcal{C}_k|x_n,\theta)}^{y_{nk}} \\ &= \log \prod_n \prod_k \Bigg(\underbrace{\frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}}}_{\text{softmax function}}\Bigg)^{y_{nk}} \\ &= \sum_n \sum_k y_{kn} \log \big( \frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}} \big) \end{align*}$$

- Computing the gradient $\nabla_{\theta_k} \mathrm{L}(\theta)$ leads to (for proof, see optional slide below)

- Compare this to the gradient for
*linear*regression:

- In both cases

- The parameter vector $\theta$ for logistic regression can be estimated through iterative gradient-based adaptation. E.g. (with iteration index $i$),
$$
\hat{\theta}^{(i+1)} = \hat{\theta}^{(i)} + \eta \cdot \left. \nabla_\theta \mathrm{L}(\theta) \right|_{\theta = \hat{\theta}^{(i)}}
$$
- Note that, while in the Bayesian approach we get to update $\theta$ with
**precision-weighted**prediction errors) (which is optimal), in the maximum likelihood approach, we weigh the prediction errors with**input**values (which is less precise).

- Note that, while in the Bayesian approach we get to update $\theta$ with

Let us perform ML estimation of $w$ on the data set from the introduction. To allow an offset in the discrimination boundary, we add a constant 1 to the feature vector $x$. We only have to specify the (negative) log-likelihood and the gradient w.r.t. $w$. Then, we use an off-the-shelf optimisation library to minimize the negative log-likelihood.

We plot the resulting maximum likelihood discrimination boundary. For comparison we also plot the ML discrimination boundary obtained from the code example in the generative Gaussian classifier lesson.

In [3]:

```
using Optim # Optimization library
y_1 = zeros(length(y))# class 1 indicator vector
y_1[findall(y)] .= 1
X_ext = vcat(X, ones(1, length(y))) # Extend X with a row of ones to allow an offset in the discrimination boundary
# Implement negative log-likelihood function
function negative_log_likelihood(θ::Vector)
# Return negative log-likelihood: -L(θ)
p_1 = 1.0 ./ (1.0 .+ exp.(-X_ext' * θ)) # P(C1|X,θ)
return -sum(log.( (y_1 .* p_1) + ((1 .- y_1).*(1 .- p_1))) ) # negative log-likelihood
end
# Use Optim.jl optimiser to minimize the negative log-likelihood function w.r.t. θ
results = optimize(negative_log_likelihood, zeros(3), LBFGS())
θ = results.minimizer
# Plot the data set and ML discrimination boundary
plotDataSet()
p_1(x) = 1.0 ./ (1.0 .+ exp(-([x;1.]' * θ)))
boundary(x1) = -1 ./ θ[2] * (θ[1]*x1 .+ θ[3])
plot([-2.;10.], boundary([-2.; 10.]), "k-");
# # Also fit the generative Gaussian model from lesson 7 and plot the resulting discrimination boundary for comparison
generative_boundary = buildGenerativeDiscriminationBoundary(X, y)
plot([-2.;10.], generative_boundary([-2;10]), "k:");
legend([L"y=0";L"y=1";L"y=?";"Discr. boundary";"Gen. boundary"], loc=3);
# Given $\hat{\theta}$, we can classify a new input $x_\bullet = [3.75, 1.0]^T$:
x_test = [3.75;1.0]
println("P(C1|x•,θ) = $(p_1(x_test))")
```

The generative model gives a bad result because the feature distribution of one class is clearly non-Gaussian: the model does not fit the data well.

The discriminative approach does not suffer from this problem because it makes no assumptions about the feature distribition $p(x|y)$, it just estimates the conditional class distribution $p(y|x)$ directly.

Generative | Discriminative (ML) | |

1 | Like density estimation, model joint prob.
$$p(\mathcal{C}_k) p(x|\mathcal{C}_k) = \pi_k \mathcal{N}(\mu_k,\Sigma)$$ | Like (linear) regression, model conditional
$$p(\mathcal{C}_k|x,\theta)$$ |

2 | Leads to softmax posterior class probability
$$ p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$
with structured $\theta$ | Choose also softmax posterior class probability
$$ p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$
but now with 'free' $\theta$ |

3 | For Gaussian $p(x|\mathcal{C}_k)$ and multinomial priors,
$$\hat \theta_k = \left[ {\begin{array}{c}
{ - \frac{1}{2} \mu_k^T \sigma^{-1} \mu_k + \log \pi_k} \\
{\sigma^{-1} \mu_k } \\
\end{array}} \right]$$
in one shot. | Find $\hat\theta_k$ through gradient-based adaptation $$\nabla_{\theta_k}\mathrm{L}(\theta) = \sum_n \Big( y_{nk} - \frac{e^{\theta_k^T x_n}}{\sum_{k^\prime} e^{\theta_{k^\prime}^T x_n}} \Big)\, x_n$$ |

- The Log-likelihood is $ \mathrm{L}(\theta) = \log \prod_n \prod_k {\underbrace{p(\mathcal{C}_k|x_n,\theta)}_{p_{nk}}}^{y_{nk}} = \sum_{n,k} y_{nk} \log p_{nk}$

- Use the fact that the softmax $\phi_k \equiv e^{a_k} / {\sum_j e^{a_j}}$ has analytical derivative:

- Take the derivative of $\mathrm{L}(\theta)$ (or: how to spend a hour ...) $$\begin{align*} \nabla_{\theta_j} \mathrm{L}(\theta) &= \sum_{n,k} \frac{\partial \mathrm{L}_{nk}}{\partial p_{nk}} \cdot\frac{\partial p_{nk}}{\partial a_{nj}}\cdot\frac{\partial a_{nj}}{\partial \theta_j} \\ &= \sum_{n,k} \frac{y_{nk}}{p_{nk}} \cdot p_{nk} (\delta_{kj}-p_{nj}) \cdot x_n \\ &= \sum_n \Big( y_{nj} (1-p_{nj}) -\sum_{k\neq j} y_{nk} p_{nj} \Big) \cdot x_n \\ &= \sum_n \left( y_{nj} - p_{nj} \right)\cdot x_n \\ &= \sum_n \Big( \underbrace{y_{nj}}_{\text{target}} - \underbrace{\frac{e^{\theta_j^T x_n}}{\sum_{j^\prime} e^{\theta_{j^\prime}^T x_n}}}_{\text{prediction}} \Big)\cdot x_n \end{align*}$$

In [4]:

```
open("../../styles/aipstyle.html") do f
display("text/html", read(f,String))
end
```

In [ ]:

```
```