$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{G} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

Tutorial on modern kernel methods




Ingmar Schuster

(Zalando Research)

Overview

  • Introduction
  • Feature engineering and two simple classification algorithms
  • Kernels and feature space
  • Applications
    • mean embedding, conditional mean embedding and operators
    • Training GANs
    • Regression with uncertainty estimates
  • Conclusion and outlook

Introduction

Kernel methods

  • Support Vector Machines (SVMs) are a staple for classification
  • Gaussian Processes (GPs) very popular for regression
  • kernel mean embedding of distributions/conditional distributions
  • recent techniques for representation learning/unsupervised learning
    • Deep Gaussian Processes
    • Compositional Kernel Machines
  • very elegant mathematics

Gaussian Processes

  • model for functions/continuous output
  • for new input returns predicted output and uncertainty
In [15]:
display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html

Support Vector Machines

  • model for classification
  • map data nonlinearly to higher dimensionsal space
  • separate points of different classes using a plane (i.e. linearly)
In [16]:
display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine

Feature engineering and two classification algorithms

Feature engineering in Machine Learning

  • feature engineering: map data to features with function $\FM:\IS\to \RKHS$
    • handle nonlinear relations with linear methods ($\FM$ nonlinear)
    • handle non-numerical data (e.g. text)
In [17]:
display(Image(filename="monomials_small.jpg", width=800)) #source: Berhard Schölkopf

Working in Feature Space

  • want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
  • definition of inner products induces norms and possibility to measure angles
  • can use linear algebra in $\RKHS$ to solve ML problems
    • inner products
    • angles
    • norms
    • distances
  • induces nonlinear operations on the Input Space (domain of $\FM$)

Two simple classification algorithms

  • given data points from mixture of two distributions with densities $p_0,p_1$: $$x_i \sim 0.5 p_0 + 0.5 p_1$$ and label $l_i = 0$ if $x_i$ generated by $p_0$, $l_i = 1$ otherwise
In [19]:
f = pl.figure(**figkw);
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    plt.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha = 0.4)
pl.show()

Classification using inner products in Feature Space

  • compute mean feature space embedding $$\mu_{0} = \frac{1}{N_0} \sum_{l_i = 0} \FM(x_i) ~~~~~~~~ \mu_{1} = \frac{1}{N_1} \sum_{l_i = 1} \FM(x_i)$$
  • assign test point to most similar class in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$ $$f_d(x) = \argmax_{c\in\{0,~1\}} \prodDot{\FM(x)}{\mu_c}$$ (remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2 $)
In [20]:
pl.figure(**figkw)
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha=0.2)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.3, width=0.05, head_length=0.3, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$");
In [21]:
pl.figure(**figkw)
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha=0.2)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.3, width=0.05, head_length=0.3, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$");
pl.scatter(np.ones(1), np.ones(1), c='k', marker='D', alpha=0.8);

Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \widehat{p}_1 = \frac{1}{N_1} \sum_{l_i = 1} \mathcal{N}(\cdot; x_i,\Sigma)$$
In [23]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)
In [24]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)

Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \widehat{p}_1 = \frac{1}{N_1} \sum_{l_i = 1} \mathcal{N}(\cdot; x_i,\Sigma)$$
  • assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
  • $\widehat{p}_c$ is known as a kernel density estimate (KDE)
  • different but overlapping notion of 'kernel'
  • classification algorithm known as Parzen windows classification

For a certain feature map and inner product, both algorithms are the same!

Let's construct this feature map and inner product.

Kernels and feature space

Positive definite functions and feature spaces

  • let $\PDK:\IS\times\IS \to \Reals$, called a kernel
    • if $\PDK$ is a symmetric and positive semi definite (psd)
    • then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$ i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

Gram matrix (1)

  • If all matrices $$\Gram_{X}=\begin{bmatrix} \PDK(x_1, x_1) & \dots & \PDK(x_1, x_N)\\ \PDK(x_2, x_1) & \ddots & \vdots\\ \vdots & & \vdots\\ \PDK(x_N, x_1) & \dots & \PDK(x_N, x_N) \end{bmatrix}$$ are symmetric positive semidefinite, then $\PDK$ is a psd kernel
  • called a gram matrix

Gram matrix (2)

  • sometimes mixed gram matrices are needed $$\Gram_{XY} = \begin{bmatrix} \PDK(x_1, y_1) & \PDK(x_1, y_2) & \dots & \PDK(x_1, y_M)\\ \PDK(x_2, y_1) & \ddots & &\vdots\\ \vdots & & &\vdots\\ \PDK(x_N, y_1) & \dots & &\PDK(x_N, y_M) \end{bmatrix} $$

Examples of psd kernels

  • Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
  • Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

PSD kernels

  • easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
  • construction for $\FM$ given $\PDK$ not trivial but still elementary
  • given $\FM$ and inner product in feature space we
    • can endow space with norm induced by the inner product $$\|g\|_\RKHS = \sqrt{\prodDot{g}{g}}_\RKHS~~~\textrm{for}~g \in \RKHS$$
    • can measure angles in the new space $$\prodDot{g}{f}_\RKHS = \cos(\angle[g,f])~\|g\|_\RKHS ~\|f\|_\RKHS$$

Construction of the canonical feature map (Aronszajn map)

Plan

  • construction of $\FM$ from $\PDK$
  • definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
  • feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$ $$\FM:\IS \to \Reals^\IS$$

Canonical feature map (Aronszajn map)

  • pick $\FM(x) = \PDK(\cdot, x)$

    • Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
    • Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
  • let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$

  • $\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

Canonical inner product (1)

  • for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product $$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$
  • In particular $\prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)

Canonical inner product (2)

  • $\RKHS$ a hilbert space with this inner product, as it is
    • positive definite
    • linear in its first argument
    • symmetric
    • complete
  • $\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

Equivalence of classification algorithms

  • Recall mean canonical feature and kernel density estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \mu_{0} = \frac{1}{N_0} \sum_{l_i = 0} \PDK(\cdot, x_i)$$
  • observe $$\frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(x^*; x_i,\Sigma) = \prodDot{\frac{1}{N_0} \sum_{l_i = 0} \PDK(\cdot, x_i)}{\PDK(\cdot, x^*)}$$ if $\PDK$ is Gaussian density with covariance $\Sigma$
  • kernel mean and Parzen windows classification are equivalent!

Lets look at example classification output

In [26]:
data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

kc = KMEclassification(data[distr_idx==0,:], data[distr_idx==1,:],  ro.LinearKernel())
plot_with_contour(data, distr_idx, kc.classification_score, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)
In [27]:
kc = KMEclassification(data[distr_idx==0,:], data[distr_idx==1,:],  ro.GaussianKernel(0.3))
plot_with_contour(data, distr_idx, kc.classification_score, 'Inner Product classif. '+"Gaussian", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

Applications

Kernel mean embedding

  • mean feature with canonical feature map $\frac{1}{N} \sum_{i = 1}^N \FM(x_i) = \frac{1}{N} \sum_{i = 1}^N \PDK(x_i, \cdot)$
  • this the estimate of the kernel mean embedding of the distribution/density $\rho$ of $x_i$ $$\mu_\rho(\cdot) = \int \PDK(x,\cdot) \mathrm{d}\rho(x)$$
  • using this we can define a distance between distributions $\rho, q$ as $$\mathrm{MMD}(\rho, q)^2 = \|\mu_\rho - \mu_q \|^2_\RKHS$$ called the maximum mean discrepancy (MMD)
  • Has been used as the critic in generative adversial networks (i.e. generative network as usual, MMD as drop-in for discriminator)

Conditional mean embedding (1)

  • we can define operators on RKHSs
  • these are maps from RKHS elements to RKHS elements (i.e. mapping between functionals)
  • one such operator is the conditional mean embedding $\mathcal{D}_{Y|X}$
    • given the embedding of the input variables distribution, returns the embedding of the output variables distribution
  • Regression with uncertainty estimate

Conditional mean embedding (2)

An example

In [29]:
cme = ro.ConditionMeanEmbedding(inp_samps, out_samps, ro.GaussianKernel(0.3), ro.GaussianKernel(0.3), 5)
plot_mean_embedding(cme, inp_samps, out_samps,  0.3, 2.,)

Conditional mean embedding (3)

  • closed form estimate given samples from input and output $$\begin{bmatrix}\PDK_Y(y_1, \cdot),& \dots &, \PDK_Y(y_N, \cdot)\end{bmatrix} \Gram_X^{-1} \begin{bmatrix}\PDK_X(x_1, \cdot)\\ \vdots \\ \PDK_X(x_N, \cdot)\end{bmatrix}$$

  • closed form estimate of output embedding for new input $x^*$

$$\prodDot{\mathcal{D}_{Y|X}}{\PDK(x^*,\cdot)} = \begin{bmatrix}\PDK_Y(y_1, \cdot),& \dots &, \PDK_Y(y_N, \cdot)\end{bmatrix} \Gram_X^{-1} \begin{bmatrix}\PDK_X(x_1, x^*)\\ \vdots \\ \PDK_X(x_N, x^*)\end{bmatrix}$$

Conditional mean embedding (4)

  • Similar to Gaussian processes, but output distribution more flexible
    • mixture of Gaussians, Laplace, distributions on discrete objects
    • multimodality can be represented
    • multidimensional output
    • output can be combination of e.g. string and reals
  • Conditional mean embedding was used to construct Kernel Bayes Rule, enabling closed form Bayesian inference
  • other types of operators have been derived (see Stefan Klus' talk next week)
In [30]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MpzaCCbX-z4?rel=0&amp;showinfo=0&amp;start=148" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')
Out[30]:
In [31]:
display(Image(filename="Pendulum_eigenfunctions.png", width=700))
In [32]:
display(Image(filename="KeywordClustering.png", width=700))

Conclusion and Outlook

Conclusion and Outlook

  • kernels implicitly compute inner products in well-behaved vector spaces
  • use angles, norms, distances in these for ML algorithms
  • successful models in ML build on PSD kernels
    • kernel induced covariance structure (Gaussian Processes)
    • kernel embedding of probability distributions
      • measuring distance between distributions
    • kernel embedding of conditional probability distributions

Get slides at ingmarschuster.com