$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{\mathbf{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{K} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

Kernel Methods in Machine Learning

Ingmar Schuster (FU Berlin)

Kernel methods¶

Support Vector Machines (SVMs) are a Machine learning staple for classification
Gaussian Processes (GPs) very popular for regression
recent techniques for representation learning/unsupervised learning
- Deep Gaussian Processes
- Compositional Kernel Machines
very elegant mathematics

Gaussian Processes¶

model for functions/continuous output
for new input returns predicted output and uncertainty

In [3]:

display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html

Support Vector Machines¶

model for classification
map data nonlinearly to higher dimensionsal space
separate points of different classes using a plane (i.e. linearly)

In [118]:

display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine

Feature engineering & two simple classification algorithms¶

Motivation for kernel methods: Feature engineering in Machine Learning¶

feature engineering: map data to features with function $\FM:\IS\to \RKHS$
- handle nonlinear relations with linear methods ($\FM$ nonlinear)
- handle non-numerical data (e.g. text)

In [119]:

display(Image(filename="monomials.jpg", width=900)) #source: Berhard Schölkopf

Working in Feature Space¶

want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
definition of inner products induces norms and possibility to measure angles
can use linear algebra in $\RKHS$ to solve ML problems
- inner products
- angles
- norms
- distances
induces nonlinear operations on the Input Space (domain of $\FM$)

Two simple classification algorithms¶

given data points from mixture of two distributions with densities $p_0,p_1$:

$$x_i \sim 0.5 p_0 + 0.5 p_1$$

and label $l_i = 0$ if $x_i$  generated by $p_0$, $l_i = 1$ otherwise

In [121]:

for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
pl.show(); pl.close()

Classification using inner products in Feature Space¶

compute mean feature space embedding $\mu_c = \frac{1}{N_c} \sum_{l_i = c} \FM(x_i)$ for $c \in \{0,1\}$
assign test point to most similar class which in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$

$$f_d(x) = \argmax_{c\in\{0,1\}} \prodDot{\FM(x)}{\mu_c}$$

(remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2 $)

In [122]:

for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.2, head_length=0.2, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$")
pl.show()

Classification using density estimation¶

estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$

In [124]:

plot_with_contour(data, distr_idx,
                  lambda x: stats.multivariate_normal.pdf(x, data[0,:], np.eye(2)*0.1),
                  colormesh_cmap=None, contour_classif=False)

In [125]:

est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)

In [126]:

est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)

Classification using density estimation¶

estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
$\widehat{p}_c$ is known as a kernel density estimate (KDE)
algorithm sometimes known as Parzen windows classification

For a certain feature map and inner product, both algorithms are the same!

Kernels and feature space¶

Positive definite functions and feature spaces¶

let $\PDK:\IS\times\IS \to \Reals$, called a kernel
- if $\PDK$ is symmetric and positive semi definite (psd) (i.e. matrix $\Gram$ defined by $\Gram_{i,j} = \PDK(x_i, x_j)$ for $x_i, x_j \in \IS$ has only nonnegative eigenvalues)
- then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$
i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

Examples of psd kernels¶

Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

PSD kernels¶

easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
construction for $\FM$ given $\PDK$ not trivial but still elementary
can endow space with norm induced by the inner product

$$\|g\|_\PDK = \sqrt{\prodDot{g}{g}} = \left(\cos(\angle[g,g])~\|g\|_\PDK ~\|g\|_\PDK\right)^{0.5}$$

for $g \in \RKHS$

Construction of the canonical feature map (Aronszajn map)¶

Plan

construction of $\FM$ from $\PDK$
definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$

$$\FM:\IS \to \Reals^\IS$$

Canonical feature map (Aronszajn map)¶

pick $\FM(x) = \PDK(\cdot, x)$
- Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
- Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$
$\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

Canonical inner product¶

for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product

$$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$

In particular $\prodDot{f}{g} = \prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)
$\RKHS$ a hilbert space with this inner product, as it is
- positive definite
- linear in its first argument
- symmetric
- complete
$\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

Equivalence of classification algorithms¶

taking the inner product between mean RKHS embedding and test point embedding equivalent to evaluating Kernel density estimate

Lets look at example classification output

In [128]:

data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  LinearKernel())
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

In [129]:

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  GaussianKernel(0.1))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Gauss", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

In [130]:

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  StudentKernel(0.1,10))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Student-t", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

Conclusion & Outlook¶

kernels implicitly compute inner products in well-behaved vector spaces
use angles, norms, distances in these for ML algorithms
successful models in ML build on PSD kernels (Support Vector Machines, Gaussian Processes)
many more kernel-based methods exist
- kernel embedding of probability distributions
- kernel induced covariance structure (GPs)

Get slides at ingmarschuster.wordpress.com