$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{\mathbf{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{K} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

Kernel Methods in Machine Learning

Ingmar Schuster (FU Berlin)

Kernel methods

  • Support Vector Machines (SVMs) are a Machine learning staple for classification
  • Gaussian Processes (GPs) very popular for regression
  • recent techniques for representation learning/unsupervised learning
    • Deep Gaussian Processes
    • Compositional Kernel Machines
  • very elegant mathematics

Gaussian Processes

  • model for functions/continuous output
  • for new input returns predicted output and uncertainty
In [3]:
display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html

Support Vector Machines

  • model for classification
  • map data nonlinearly to higher dimensionsal space
  • separate points of different classes using a plane (i.e. linearly)
In [118]:
display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine

Feature engineering & two simple classification algorithms

Motivation for kernel methods: Feature engineering in Machine Learning

  • feature engineering: map data to features with function $\FM:\IS\to \RKHS$
    • handle nonlinear relations with linear methods ($\FM$ nonlinear)
    • handle non-numerical data (e.g. text)
In [119]:
display(Image(filename="monomials.jpg", width=900)) #source: Berhard Schölkopf

Working in Feature Space

  • want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
  • definition of inner products induces norms and possibility to measure angles
  • can use linear algebra in $\RKHS$ to solve ML problems
    • inner products
    • angles
    • norms
    • distances
  • induces nonlinear operations on the Input Space (domain of $\FM$)

Two simple classification algorithms

  • given data points from mixture of two distributions with densities $p_0,p_1$: $$x_i \sim 0.5 p_0 + 0.5 p_1$$ and label $l_i = 0$ if $x_i$ generated by $p_0$, $l_i = 1$ otherwise
In [121]:
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
pl.show(); pl.close()

Classification using inner products in Feature Space

  • compute mean feature space embedding $\mu_c = \frac{1}{N_c} \sum_{l_i = c} \FM(x_i)$ for $c \in \{0,1\}$
  • assign test point to most similar class which in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$ $$f_d(x) = \argmax_{c\in\{0,1\}} \prodDot{\FM(x)}{\mu_c}$$ (remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2 $)
In [122]:
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.2, head_length=0.2, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$")
pl.show()

Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
In [124]:
plot_with_contour(data, distr_idx,
                  lambda x: stats.multivariate_normal.pdf(x, data[0,:], np.eye(2)*0.1),
                  colormesh_cmap=None, contour_classif=False)
In [125]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)
In [126]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)

Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
  • assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
  • $\widehat{p}_c$ is known as a kernel density estimate (KDE)
  • algorithm sometimes known as Parzen windows classification

For a certain feature map and inner product, both algorithms are the same!

Kernels and feature space

Positive definite functions and feature spaces

  • let $\PDK:\IS\times\IS \to \Reals$, called a kernel
    • if $\PDK$ is symmetric and positive semi definite (psd) (i.e. matrix $\Gram$ defined by $\Gram_{i,j} = \PDK(x_i, x_j)$ for $x_i, x_j \in \IS$ has only nonnegative eigenvalues)
    • then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$ i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

Examples of psd kernels

  • Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
  • Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

PSD kernels

  • easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
  • construction for $\FM$ given $\PDK$ not trivial but still elementary
  • can endow space with norm induced by the inner product $$\|g\|_\PDK = \sqrt{\prodDot{g}{g}} = \left(\cos(\angle[g,g])~\|g\|_\PDK ~\|g\|_\PDK\right)^{0.5}$$ for $g \in \RKHS$

Construction of the canonical feature map (Aronszajn map)

Plan

  • construction of $\FM$ from $\PDK$
  • definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
  • feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$ $$\FM:\IS \to \Reals^\IS$$

Canonical feature map (Aronszajn map)

  • pick $\FM(x) = \PDK(\cdot, x)$

    • Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
    • Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
  • let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$

  • $\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

Canonical inner product

  • for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product $$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$
  • In particular $\prodDot{f}{g} = \prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)
  • $\RKHS$ a hilbert space with this inner product, as it is
    • positive definite
    • linear in its first argument
    • symmetric
    • complete
  • $\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

Equivalence of classification algorithms

  • taking the inner product between mean RKHS embedding and test point embedding equivalent to evaluating Kernel density estimate

Lets look at example classification output

In [128]:
data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  LinearKernel())
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)
In [129]:
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  GaussianKernel(0.1))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Gauss", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)
In [130]:
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  StudentKernel(0.1,10))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Student-t", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)