Notebook

Support Vector Machines¶

The idea behind support vector machines is to construct a set of optimal separating hyperplanes between two classes that may not be spearable by a linear boundary

Advantages and Disadvantages¶

Advantages	Disadvantages
very good effectivity in high dimensional spaces	If #features >> #samples overfitting needs to be avoided by specific kernel function/reg. term
good effectivity in lower dimensional spaces if #dim > # samples	No probabilities directly provided, need to perform fold cross validation (expensive)
memory efficient (uses subset of the training points)	-
versatility (kernel functions can be specified)	No

Mathematical Motivation¶

Considering we want to separate using hyperplanes, the classification is less prone to generalisation errors if the distance between the hyperplane and the nearest data point of any class is large. Therefore, maximising the margin is the aim.

There are different types of support vector machines. For classification we will introducte SVC and Nu-SVC/C-SVC, for regression we will introduce SVR.

Classification¶

SVC¶

The concept described above is caputured in the following primal and dual optimisation problem:

Consider a set of training points $x_i \in \mathbb{R}^p$, $i=1,...,n$ in two classes and a vector $y \in \{-1,1\}$. Then the primal optimisation problem is formulated as:

$$ \min_{w,b,\zeta}\left(\frac{1}{2}w^Tw + C \sum_{i=1}^{n}\zeta_i\right)$$

subject to $\hspace{1cm}$ $y_i(w^T\phi(x_i)+b)\geq 1-\zeta_i$, $\hspace{1cm}$ $\zeta_i \geq 0$, $\hspace{1cm}$ $i=1,...,n$

The corresponding dual problem is then:

$$ \min_{\alpha}\left(\frac{1}{2}\alpha^TQ\alpha-e^T\alpha\right)$$

subject to $\hspace{1cm}$ $y^T\alpha=0$, $\hspace{1cm}$ $0 \leq \alpha_i \leq C$, $\hspace{1cm}$ $i=1,...,n$

where $e$ is a vector fo all ones, $C>0$ is the upper bound, $Q$ is a nxn positive semidefinite matrix where $Q_{ij} \equiv y_iy_jK(x_i,x_j)$ where $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)$ is the kernel. The decision function is $sgn\left(\sum_{i=1}^ny_i\alpha_iK(x_i,x)+\rho\right)$.

Nu-SVC/C-SVC¶

We introduce a new parameter $\nu \in [0,1]$ which acts as a an uppor bound on the fraction of train errors and as a lower bound on the fraction of support vectors. Therefore it can be used to control the number of support vectors and the training errors. The formulation is Nu-SVC is equivalent to the formulation of C-SVC (for further reading I recommend https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf).

Regression¶

SVR¶

Consider a set of training points $x_i \in \mathbb{R}^p$, $i=1,...,n$ in two classes and a vector $y \in \{-1,1\}$. Under given parameters $C > 0$ and $\epsilon >0$ the standard form of a primal problem considering support vector regression is

$$ \min_{w,b,\zeta,\zeta^*}\left(\frac{1}{2} w^T w + C\sum_{i=1}^n(\zeta_i+\zeta_i^*)\right)$$

subject to $\hspace{1cm}$ $y_i-w^T\phi(x_i)-b \leq \epsilon+\zeta_i$, $\hspace{1cm}$ $w^T\phi(x_i)+b-y_i \leq \epsilon + \zeta_i^*$ , $\hspace{1cm}$ $\zeta_i,\zeta_i^*\geq 0$,$\hspace{1cm}$ $i=1,...,n$.

The corresponding dual problem is:

$$\min_{\alpha,\alpha^*}\left(\frac{1}{2}(\alpha-\alpha^*)^TQ(\alpha-\alpha^*)+\epsilon e^T(\alpha+\alpha^*)-y^T(\alpha-\alpha^*)\right)$$

subject to $\hspace{1cm}$ $e^T(\alpha - \alpha^*)=0$, $\hspace{1cm}$ $0 \leq \alpha_i$,$\hspace{1cm}$ $\alpha_i^*\leq C$,$\hspace{1cm}$ $i=1,...,n$

Multiclass Classification¶

In the mathematical formulation we see that SVMs traditionally separate binary classes. To apply these methods to a multiclass problem we need to adapt further algorithms. We will present two basic approaches here. For further reading I recommend http://www.springer.com/cda/content/document/cda_downloaddocument/9783319022994-c1.pdf?SGWID=0-0-45-1446422-p175468473.

"One-against-one approach"¶

In this approach all possible pairwise classifiers are evaluated. Assuming we have $n$ classes, we get $\binom n2= \frac{n(n-1)}{2}$ classifiers. Given a test problem, each classifer is applied to it and the winning class is labelled. After all classifiers where applied, the class that got most labels is returned as output. Even though many classifiers are initialised the size of the Quadratic Programming problem is relatively small, therefore the training can be done fast.

"One-against-rest-approcah"¶

In this approach, for n different classes n different binary classifers are constructed. For classes $i=1,...,n$ the $ith$ classifier uses the elements of class $i$ as positive training examples and the remaining classes as negative training examples. While testing the class label is then determined by the classifier that returns the maximum output value. A disadvantage of this method is the forced skewedness of the training data. If the original problem is symmetric, meaning it has the same amount of instances in every class, the constructed subproblems will have $1/n$ examples in one class and $n-1/n$ examples in the other. This breaks the symmetry of the original problem.

Implementations¶

LIBSVM¶

The most used support vector machine packages in popular coding languages (Python, R, MATLAB) wrap around the LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/), which is a library for support vector machines written in C. It offers the following functionalities:

Different SVM formulations
Efficient multi-class classification
Cross validation for model selection
Probability estimates
Various kernels (including precomputed kernel matrix)
Weighted SVM for unbalanced data
Automatic model selection which can generate contour of cross validation accuracy

We will therefore use these functionalities as a basis to compare julia packages against.

Julia Packages¶

LIBSVM.jl¶

Functionality	Yes/No
Package working	Yes
Different SVM formulations	Yes
Efficient multi-class classification	Yes
Cross validation for model selection	No
Probability estimates	Yes
Various kernels	Yes
Weighted SVM for unbalanced data	Yes
Automatic model selection	No

A more detailed review can be found here:

https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/LIBSVM.jl.ipynb

SVM.jl¶

Functionality	Yes/No
Package working	No, will be removed
Different SVM formulations	Yes (contains only two algorithms, only linear)
Efficient multi-class classification	No
Cross validation for model selection	No
Probability estimates	No
Various kernels	No
Weighted SVM for unbalanced data	Yes
Automatic model selection	No

A more detailed review can be found here:

https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/SVM.jl.ipynb

KSVM.jl¶

Functionality	Yes/No
Package working	No
Different SVM formulations	Yes (only linear SVMs)
Efficient multi-class classification	Yes
Cross validation for model selection	No
Probability estimates	No
Various kernels	Yes (only linear)
Weighted SVM for unbalanced data	No
Automatic model selection	No

A more detailed review can be found here:

https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/KSVM.jl.ipynb