Discriminative Classification

Preliminaries

Problem: difficult class-conditional data distribitions

Our task will be the same as in the preceding class on (generative) classification. But this time, the class-conditional data distributions look very non-Gaussian, yet the linear discriminative boundary looks easy enough:

In [7]:
# Generate dataset {(x1,y1),...,(xN,yN)}
# x is a 2-d feature vector [x_1;x_2]
# y ∈ {false,true} is a binary class label
# p(x|y) is multi-modal (mixture of uniform and Gaussian distributions)
using PyPlot
include("scripts/lesson8_helpers.jl")
N = 200
X, y = genDataset(N) # Generate data set, collect in matrix X and vector y
X_c1 = X[:,find(.!y)]'; X_c2 = X[:,find(y)]' # Split X based on class label
X_test = [3.75; 1.0] # Features of 'new' data point
function plotDataSet()
    plot(X_c1[:,1], X_c1[:,2], "bx", markersize=8)
    plot(X_c2[:,1], X_c2[:,2], "r+", markersize=8, fillstyle="none")
    plot(X_test[1], X_test[2], "ko")   
    xlabel(L"x_1"); ylabel(L"x_2"); legend([L"y=0", L"y=1",L"y=?"], loc=2)
    xlim([-2;10]); ylim([-4, 8])
end
plotDataSet();

Main Idea of Discriminative Classification

  • Again, a data set is given by $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$ with $x_n \in \mathbb{R}^D$ and $y_n \in \mathcal{C}_k$, with $k=1,\ldots,K$.
  • Sometimes, the precise assumptions of the (multinomial-Gaussian) generative model $$p(x_n,\mathcal{C}_k|\theta) = \pi_k \cdot \mathcal{N}(x_n|\mu_k,\Sigma)$$ clearly do not match the data distribution.
  • Here's an IDEA! Let's model the posterior $$p(\mathcal{C}_k|x_n)$$ directly, without any assumptions on the class densities.
  • Of course, this implies also that we build direct models for the discrimination boundaries $$\log \frac{p(\mathcal{C}_k|x_n)}{p(\mathcal{C}_j|x_n)} \overset{!}{=} 0$$

1. Model Specification

  • [Q.] What model should we use for $p(\mathcal{C}_k|x_n)$?
  • [A.] Get inspiration from the generative approach: choose the familiar softmax structure with linear discrimination bounderies for the posterior class probability $$ p(\mathcal{C}_k|x_n,\theta) = \frac{e^{\theta_k^T x_n}}{\sum_j e^{\theta_j^T x_n}} $$ but do not impose a Gaussian structure on the class features.
  • $\Rightarrow$ There are two key differences between the discriminative and generative approach:
    1. In the discriminative approach, the parameters $\theta_k$ are not structured into $\{\mu_k,\Sigma,\pi_k \}$. This provides discriminative approach with more flexibility.
    2. ML learning for the discriminative approach by optimization of conditional likelihood $\prod_n p(y_n|x_n,\theta)$ rather than joint likelihood $\prod_n p(y_n,x_n|\theta)$.

2. ML Estimation for Discriminative Classification

  • The conditional log-likelihood for discriminative classification is

    $$ \mathrm{L}(\theta) = \log \prod_n \prod_k {p(\mathcal{C}_k|x_n,\theta)}^{y_{nk}} $$

  • Computing the gradient $\nabla_{\theta_k} \mathrm{L}(\theta)$ (NB: revised text) leads to (for proof, see next slide)

$$ \nabla_{\theta_k} \mathrm{L}(\theta) = \sum_n \Big( \underbrace{y_{nk}}_{\text{target}} - \underbrace{\frac{e^{\theta_k^T x_n}}{ \sum_j e^{\theta_j^T x_n}}}_{\text{prediction}} \Big)\cdot x_n $$

  • Compare this to the gradient for linear regression:

$$ \nabla_\theta \mathrm{L}(\theta) = \sum_n \left(y_n - \theta^T x_n \right) x_n $$

  • In both cases

$$ \nabla_\theta \mathrm{L} = \sum_n \left( \text{target}_n - \text{prediction}_n \right) \cdot \text{input}_n $$

  • The parameter vector $\theta$ for logistic regression can be estimated through iterative gradient-based adaptation. E.g. (with iteration index $i$),

$$ \hat{\theta}^{(i+1)} = \hat{\theta}^{(i)} + \eta \cdot \left. \nabla_\theta \mathrm{L}(\theta) \right|_{\theta = \hat{\theta}^{(i)}} $$

2. (OPTIONAL) Proof of Derivative of Log-likelihood for Discriminative Classification

  • The Log-likelihood is $ \mathrm{L}(\theta) = \log \prod_n \prod_k {\underbrace{p(\mathcal{C}_k|x_n,\theta)}_{p_{nk}}}^{y_{nk}} = \sum_{n,k} y_{nk} \log p_{nk}$
  • Use the fact that the softmax $\phi_k \equiv e^{a_k} / {\sum_j e^{a_j}}$ has analytical derivative:

$$ \begin{align*} \frac{\partial \phi_k}{\partial a_j} &= \frac{(\sum_j e^{a_j})e^{a_k}\delta_{kj}-e^{a_j}e^{a_k}}{(\sum_j e^{a_j})^2} = \frac{e^{a_k}}{\sum_j e^{a_j}}\delta_{kj} - \frac{e^{a_j}}{\sum_j e^{a_j}} \frac{e^{a_k}}{\sum_j e^{a_j}}\\ &= \phi_k \cdot(\delta_{kj}-\phi_j) \end{align*}$$

  • Take the derivative of $\mathrm{L}(\theta)$ (or: how to spend a hour ...) $$\begin{align*} \nabla_{\theta_j} \mathrm{L}(\theta) &= \sum_{n,k} \frac{\partial \mathrm{L}_{nk}}{\partial p_{nk}} \cdot\frac{\partial p_{nk}}{\partial a_{nj}}\cdot\frac{\partial a_{nj}}{\partial \theta_j} \\ &= \sum_{n,k} \frac{y_{nk}}{p_{nk}} \cdot p_{nk} (\delta_{kj}-p_{nj}) \cdot x_n \\ &= \sum_n \Big( y_{nj} (1-p_{nj}) -\sum_{k\neq j} y_{nk} p_{nj} \Big) \cdot x_n \\ &= \sum_n \left( y_{nj} - p_{nj} \right)\cdot x_n \\ &= \sum_n \Big( \underbrace{y_{nj}}_{\text{target}} - \underbrace{\frac{e^{\theta_j^T x_n}}{\sum_{j^\prime} e^{\theta_{j^\prime}^T x_n}}}_{\text{prediction}} \Big)\cdot x_n \end{align*}$$

3. Application - Classify a new input

  • Discriminative model-based prediction for a new input $x_\bullet$ is easy, namely substitute the ML estimate in the model to get

$$ p(\mathcal{C}_k |\, x_\bullet,\hat\theta) = \frac{ \mathrm{exp}\left( \hat \theta_k^T x_\bullet \right) }{ \sum_{k^\prime} \mathrm{exp}\left(\hat \theta_{k^\prime}^T x_\bullet \right)} \propto \mathrm{exp}\left(\hat \theta_k^T x_\bullet\right) $$

  • The contours of equal probability (discriminant boundaries) are lines (hyperplanes) in feature space given by $$ \log \frac{{p(\mathcal{C}_k|x,\hat \theta )}}{{p(\mathcal{C}_j|x,\hat \theta )}} = \left( \hat{\theta}_{k} - \hat{\theta}_j\right) ^T x = 0 $$

CODE EXAMPLE

Let us perform ML estimation of $\theta$ on the data set from the introduction. To allow an offset in the discrimination boundary, we add a constant 1 to the feature vector $x$. We only have to specify the (negative) log-likelihood and the gradient w.r.t. $\theta$. Then, we use an off-the-shelf optimisation library to minimize the negative log-likelihood.

We plot the resulting maximum likelihood discrimination boundary. For comparison we also plot the ML discrimination boundary obtained from the generative Gaussian classifier from lesson 7.

In [8]:
using Optim # Optimization library

y_1 = zeros(length(y)) # class 1 indicator vector
y_1[y.==false] = 1.
X_ext = vcat(X, ones(1, length(y))) # Extend X with a row of ones to allow an offset in the discrimination boundary

# Implement negative log-likelihood function
function negative_log_likelihood(θ::Vector)
    # Return negative log-likelihood: -L(θ)
    p_1 = 1.0 ./ (1.0 + exp.(-X_ext' * θ))   # P(C1|X,θ)
    return -sum(log.( (y_1.*p_1) + ((1.-y_1).*(1.-p_1))) ) # negative log-likelihood
end

# Use Optim.jl optimiser to minimize the negative log-likelihood function w.r.t. θ
results = optimize(negative_log_likelihood, zeros(3), LBFGS())
θ = results.minimizer

# Plot the data set and ML discrimination boundary
plotDataSet()
p_1(x) = 1.0 ./ (1.0 + exp(-([x;1.]' * θ)))
boundary(x1) = -1./θ[2] * (θ[1]*x1 + θ[3])
plot([-2.;10.], boundary([-2.;10.]), "k-");

# Also fit the generative Gaussian model from lesson 7 and plot the resulting discrimination boundary for comparison
generative_boundary = buildGenerativeDiscriminationBoundary(X, y)
plot([-2.;10.], generative_boundary([-2.;10.]), "k:");
legend([L"y=0";L"y=1";L"y=?";"Discr. boundary";"Gen. boundary"], loc=3);

Given $\hat{\theta}$, we can classify a new input $x_\bullet = [3.75, 1.0]^T$:

In [9]:
x_test = [3.75;1.0]
println("P(C1|x•,θ) = $(p_1(x_test))")
P(C1|x•,θ) = 0.6476513551215346
  • The generative model gives a bad result because the feature distribution of one class is clearly non-Gaussian: the model does not fit the data well.
  • The discriminative approach does not suffer from this problem because it makes no assumptions about the feature distribition $p(x|y)$, it just estimates the conditional class distribution $p(y|x)$ directly.

Recap Classification

Generative Discriminative
1Like density estimation, model joint prob. $$p(\mathcal{C}_k) p(x|\mathcal{C}_k) = \pi_k \mathcal{N}(\mu_k,\Sigma)$$ Like (linear) regression, model conditional $$p(\mathcal{C}_k|x,\theta)$$
2Leads to softmax posterior class probability $$ p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ with structured $\theta$ Choose also softmax posterior class probability $$ p(\mathcal{C}_k|x,\theta ) = e^{\theta_k^T x}/Z$$ but now with 'free' $\theta$
3For Gaussian $p(x|\mathcal{C}_k)$ and multinomial priors, $$\hat \theta_k = \left[ {\begin{array}{c} { - \frac{1}{2} \mu_k^T \sigma^{-1} \mu_k + \log \pi_k} \\ {\sigma^{-1} \mu_k } \\ \end{array}} \right]$$ in one shot. Find $\hat\theta_k$ through gradient-based adaptation $$\nabla_{\theta_k}\mathrm{L}(\theta) = \sum_n \Big( y_{nk} - \frac{e^{\theta_k^T x_n}}{\sum_{k^\prime} e^{\theta_{k^\prime}^T x_n}} \Big)\, x_n$$

The cell below loads the style file.

In [10]:
open("../../styles/aipstyle.html") do f
    display("text/html", readstring(f))
end