The idea behind support vector machines is to construct a set of optimal separating hyperplanes between two classes that may not be spearable by a linear boundary
Advantages | Disadvantages |
---|---|
very good effectivity in high dimensional spaces | If #features >> #samples overfitting needs to be avoided by specific kernel function/reg. term |
good effectivity in lower dimensional spaces if #dim > # samples | No probabilities directly provided, need to perform fold cross validation (expensive) |
memory efficient (uses subset of the training points) | - |
versatility (kernel functions can be specified) | No |
Considering we want to separate using hyperplanes, the classification is less prone to generalisation errors if the distance between the hyperplane and the nearest data point of any class is large. Therefore, maximising the margin is the aim.
There are different types of support vector machines. For classification we will introducte SVC and Nu-SVC/C-SVC, for regression we will introduce SVR.
The concept described above is caputured in the following primal and dual optimisation problem:
Consider a set of training points $x_i \in \mathbb{R}^p$, $i=1,...,n$ in two classes and a vector $y \in \{-1,1\}$. Then the primal optimisation problem is formulated as:
$$ \min_{w,b,\zeta}\left(\frac{1}{2}w^Tw + C \sum_{i=1}^{n}\zeta_i\right)$$subject to $\hspace{1cm}$ $y_i(w^T\phi(x_i)+b)\geq 1-\zeta_i$, $\hspace{1cm}$ $\zeta_i \geq 0$, $\hspace{1cm}$ $i=1,...,n$
The corresponding dual problem is then:
$$ \min_{\alpha}\left(\frac{1}{2}\alpha^TQ\alpha-e^T\alpha\right)$$subject to $\hspace{1cm}$ $y^T\alpha=0$, $\hspace{1cm}$ $0 \leq \alpha_i \leq C$, $\hspace{1cm}$ $i=1,...,n$
where $e$ is a vector fo all ones, $C>0$ is the upper bound, $Q$ is a nxn positive semidefinite matrix where $Q_{ij} \equiv y_iy_jK(x_i,x_j)$ where $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)$ is the kernel. The decision function is $sgn\left(\sum_{i=1}^ny_i\alpha_iK(x_i,x)+\rho\right)$.
We introduce a new parameter $\nu \in [0,1]$ which acts as a an uppor bound on the fraction of train errors and as a lower bound on the fraction of support vectors. Therefore it can be used to control the number of support vectors and the training errors. The formulation is Nu-SVC is equivalent to the formulation of C-SVC (for further reading I recommend https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf).
Consider a set of training points $x_i \in \mathbb{R}^p$, $i=1,...,n$ in two classes and a vector $y \in \{-1,1\}$. Under given parameters $C > 0$ and $\epsilon >0$ the standard form of a primal problem considering support vector regression is
$$ \min_{w,b,\zeta,\zeta^*}\left(\frac{1}{2} w^T w + C\sum_{i=1}^n(\zeta_i+\zeta_i^*)\right)$$subject to $\hspace{1cm}$ $y_i-w^T\phi(x_i)-b \leq \epsilon+\zeta_i$, $\hspace{1cm}$ $w^T\phi(x_i)+b-y_i \leq \epsilon + \zeta_i^*$ , $\hspace{1cm}$ $\zeta_i,\zeta_i^*\geq 0$,$\hspace{1cm}$ $i=1,...,n$.
The corresponding dual problem is:
$$\min_{\alpha,\alpha^*}\left(\frac{1}{2}(\alpha-\alpha^*)^TQ(\alpha-\alpha^*)+\epsilon e^T(\alpha+\alpha^*)-y^T(\alpha-\alpha^*)\right)$$subject to $\hspace{1cm}$ $e^T(\alpha - \alpha^*)=0$, $\hspace{1cm}$ $0 \leq \alpha_i$,$\hspace{1cm}$ $\alpha_i^*\leq C$,$\hspace{1cm}$ $i=1,...,n$
where $e$ is a vector fo all ones, $C>0$ is the upper bound, $Q$ is a nxn positive semidefinite matrix where $Q_{ij} \equiv y_iy_jK(x_i,x_j)$ where $K(x_i,x_j)=\phi(x_i)^T\phi(x_j)$ is the kernel. The decision function is $\sum_{i=1}^n(\alpha_i-\alpha_i^*)K(x_i,x)+\rho$.
In the mathematical formulation we see that SVMs traditionally separate binary classes. To apply these methods to a multiclass problem we need to adapt further algorithms. We will present two basic approaches here. For further reading I recommend http://www.springer.com/cda/content/document/cda_downloaddocument/9783319022994-c1.pdf?SGWID=0-0-45-1446422-p175468473.
In this approach all possible pairwise classifiers are evaluated. Assuming we have $n$ classes, we get $\binom n2= \frac{n(n-1)}{2}$ classifiers. Given a test problem, each classifer is applied to it and the winning class is labelled. After all classifiers where applied, the class that got most labels is returned as output. Even though many classifiers are initialised the size of the Quadratic Programming problem is relatively small, therefore the training can be done fast.
In this approach, for n different classes n different binary classifers are constructed. For classes $i=1,...,n$ the $ith$ classifier uses the elements of class $i$ as positive training examples and the remaining classes as negative training examples. While testing the class label is then determined by the classifier that returns the maximum output value. A disadvantage of this method is the forced skewedness of the training data. If the original problem is symmetric, meaning it has the same amount of instances in every class, the constructed subproblems will have $1/n$ examples in one class and $n-1/n$ examples in the other. This breaks the symmetry of the original problem.
The most used support vector machine packages in popular coding languages (Python, R, MATLAB) wrap around the LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/), which is a library for support vector machines written in C. It offers the following functionalities:
Functionality | Yes/No |
---|---|
Package working | Yes |
Different SVM formulations | Yes |
Efficient multi-class classification | Yes |
Cross validation for model selection | No |
Probability estimates | Yes |
Various kernels | Yes |
Weighted SVM for unbalanced data | Yes |
Automatic model selection | No |
A more detailed review can be found here:
https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/LIBSVM.jl.ipynb
Functionality | Yes/No |
---|---|
Package working | No, will be removed |
Different SVM formulations | Yes (contains only two algorithms, only linear) |
Efficient multi-class classification | No |
Cross validation for model selection | No |
Probability estimates | No |
Various kernels | No |
Weighted SVM for unbalanced data | Yes |
Automatic model selection | No |
A more detailed review can be found here:
https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/SVM.jl.ipynb
Functionality | Yes/No |
---|---|
Package working | No |
Different SVM formulations | Yes (only linear SVMs) |
Efficient multi-class classification | Yes |
Cross validation for model selection | No |
Probability estimates | No |
Various kernels | Yes (only linear) |
Weighted SVM for unbalanced data | No |
Automatic model selection | No |
A more detailed review can be found here:
https://github.com/dominusmi/warwick-rsg/blob/master/Scouting/KSVM.jl.ipynb