$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{\mathbf{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{K} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

# Kernel Methods in Machine Learning

## Kernel methods¶

• Support Vector Machines (SVMs) are a Machine learning staple for classification
• Gaussian Processes (GPs) very popular for regression
• recent techniques for representation learning/unsupervised learning
• Deep Gaussian Processes
• Compositional Kernel Machines
• very elegant mathematics

## Gaussian Processes¶

• model for functions/continuous output
• for new input returns predicted output and uncertainty
In [3]:
display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html


## Support Vector Machines¶

• model for classification
• map data nonlinearly to higher dimensionsal space
• separate points of different classes using a plane (i.e. linearly)
In [118]:
display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine


# Motivation for kernel methods: Feature engineering in Machine Learning¶

• feature engineering: map data to features with function $\FM:\IS\to \RKHS$
• handle nonlinear relations with linear methods ($\FM$ nonlinear)
• handle non-numerical data (e.g. text)
In [119]:
display(Image(filename="monomials.jpg", width=900)) #source: Berhard SchÃ¶lkopf


### Working in Feature Space¶

• want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
• definition of inner products induces norms and possibility to measure angles
• can use linear algebra in $\RKHS$ to solve ML problems
• inner products
• angles
• norms
• distances
• induces nonlinear operations on the Input Space (domain of $\FM$)

### Two simple classification algorithms¶

• given data points from mixture of two distributions with densities $p_0,p_1$: $$x_i \sim 0.5 p_0 + 0.5 p_1$$ and label $l_i = 0$ if $x_i$ generated by $p_0$, $l_i = 1$ otherwise
In [121]:
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
pl.show(); pl.close()


### Classification using inner products in Feature Space¶

• compute mean feature space embedding $\mu_c = \frac{1}{N_c} \sum_{l_i = c} \FM(x_i)$ for $c \in \{0,1\}$
• assign test point to most similar class which in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$ $$f_d(x) = \argmax_{c\in\{0,1\}} \prodDot{\FM(x)}{\mu_c}$$ (remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2$)
In [122]:
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
pl.title(r"Mean embeddings for $\Phi(x)=x$")
pl.show()


### Classification using density estimation¶

• estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
In [124]:
plot_with_contour(data, distr_idx,
lambda x: stats.multivariate_normal.pdf(x, data[0,:], np.eye(2)*0.1),
colormesh_cmap=None, contour_classif=False)

In [125]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
lambda x: exp(est_dens_1.logpdf(x)),
colormesh_cmap=None, contour_classif=False)

In [126]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
lambda x: exp(est_dens_1.logpdf(x)),
colormesh_cmap=None, contour_classif=False)


### Classification using density estimation¶

• estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
• assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
• $\widehat{p}_c$ is known as a kernel density estimate (KDE)
• algorithm sometimes known as Parzen windows classification

# Kernels and feature space¶

## Positive definite functions and feature spaces¶

• let $\PDK:\IS\times\IS \to \Reals$, called a kernel
• if $\PDK$ is symmetric and positive semi definite (psd) (i.e. matrix $\Gram$ defined by $\Gram_{i,j} = \PDK(x_i, x_j)$ for $x_i, x_j \in \IS$ has only nonnegative eigenvalues)
• then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$ i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

## Examples of psd kernels¶

• Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
• Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

## PSD kernels¶

• easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
• construction for $\FM$ given $\PDK$ not trivial but still elementary
• can endow space with norm induced by the inner product $$\|g\|_\PDK = \sqrt{\prodDot{g}{g}} = \left(\cos(\angle[g,g])~\|g\|_\PDK ~\|g\|_\PDK\right)^{0.5}$$ for $g \in \RKHS$

### Construction of the canonical feature map (Aronszajn map)¶

Plan

• construction of $\FM$ from $\PDK$
• definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
• feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$ $$\FM:\IS \to \Reals^\IS$$

### Canonical feature map (Aronszajn map)¶

• pick $\FM(x) = \PDK(\cdot, x)$

• Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
• Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
• let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$

• $\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

### Canonical inner product¶

• for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product $$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$
• In particular $\prodDot{f}{g} = \prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)
• $\RKHS$ a hilbert space with this inner product, as it is
• positive definite
• linear in its first argument
• symmetric
• complete
• $\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

### Equivalence of classification algorithms¶

• taking the inner product between mean RKHS embedding and test point embedding equivalent to evaluating Kernel density estimate

Lets look at example classification output

In [128]:
data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  LinearKernel())
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

In [129]:
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  GaussianKernel(0.1))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Gauss", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

In [130]:
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  StudentKernel(0.1,10))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Student-t", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)