Are Support Vector Machines still considered “state of the art” in their niche?

https://datascience.stackexchange.com/questions/711

16-10-2019
|

Question

This question is in response to a comment I saw on another question.

The comment was regarding the Machine Learning course syllabus on Coursera, and along the lines of "SVMs are not used so much nowadays".

I have only just finished the relevant lectures myself, and my understanding of SVMs is that they are a robust and efficient learning algorithm for classification, and that when using a kernel, they have a "niche" covering number of features perhaps 10 to 1000 and number of training samples perhaps 100 to 10,000. The limit on training samples is because the core algorithm revolves around optimising results generated from a square matrix with dimensions based on number of training samples, not number of original features.

So does the comment I saw refer some real change since the course was made, and if so, what is that change: A new algorithm that covers SVM's "sweet spot" just as well, better CPUs meaning SVM's computational advantages are not worth as much? Or is it perhaps opinion or personal experience of the commenter?

I tried a search for e.g. "are support vector machines out of fashion" and found nothing to imply they were being dropped in favour of anything else.

And Wikipedia has this: http://en.wikipedia.org/wiki/Support_vector_machine#Issues . . . the main sticking point appears to be difficulty of interpreting the model. Which makes SVM fine for a black-box predicting engine, but not so good for generating insights. I don't see that as a major issue, just another minor thing to take into account when picking the right tool for the job (along with nature of the training data and learning task etc).

Solution

SVM is a powerful classifier. It has some nice advantages (which I guess were responsible for its popularity)... These are:

Efficiency: Only the support vectors play a role in determining the classification boundary. All other points from the training set needn't be stored in memory.
The so-called power of kernels: With appropriate kernels you can transform feature space into a higher dimension so that it becomes linearly separable. The notion of kernels work with arbitrary objects on which you can define some notion of similarity with the help of inner products... and hence SVMs can classify arbitrary objects such as trees, graphs etc.

There are some significant disadvantages as well.

Parameter sensitivity: The performance is highly sensitive to the choice of the regularization parameter C, which allows some variance in the model.
Extra parameter for the Gaussian kernel: The radius of the Gaussian kernel can have a significant impact on classifier accuracy. Typically a grid search has to be conducted to find optimal parameters. LibSVM has a support for grid search.

SVMs generally belong to the class of "Sparse Kernel Machines". The sparse vectors in the case of SVM are the support vectors which are chosen from the maximum margin criterion. Other sparse vector machines such as the Relevance Vector Machine (RVM) perform better than SVM. The following figure shows a comparative performance of the two. In the figure, the x-axis shows one dimensional data from two classes y={0,1}. The mixture model is defined as P(x|y=0)=Unif(0,1) and P(x|y=1)=Unif(.5,1.5) (Unif denotes uniform distribution). 1000 points were sampled from this mixture and an SVM and an RVM were used to estimate the posterior. The problem of SVM is that the predicted values are far off from the true log odds.

RVM vs. SVM

A very effective classifier, which is very popular nowadays, is the Random Forest. The main advantages are:

Only one parameter to tune (i.e. the number of trees in the forest)
Not utterly parameter sensitive
Can easily be extended to multiple classes
Is based on probabilistic principles (maximizing mutual information gain with the help of decision trees)

OTHER TIPS

In answering this question one significant distinction to make is whether we are talking about linear Support Vector Machines or non-linear, that is, kernelized Support Vector Machines.

Linear SVMs

Linear SVMs are both in theory and practice very good models when your data can be explained by linear relations of your features. They are superior over classic methods such as linear (a.k.a. least-squares) regression because they are robust, in the sense that small perturbations in the input data do not produce significant changes in the model. This is attained by trying to find the line (hyperplane) that maximizes the margins between your data points. This maximum margin hyperplane has been shown to give guarantees on the generalization ability of the model over unseen data points, a theoretical property other machine learning methods lack of.

Linear SVMs are also interpretable as any other linear model, since each input feature has a weight that directly influences the model output.

Also linear SVMs are very fast to train, showing sublineal training times for very large datasets. This is achieved by making use of stochastic gradient descent techniques, much in the fashion of current deep learning methods.

Non-linear SVMs

Non-linear SVMs are still linear models, and boast the same theoretical benefits, but they employ the so called kernel trick to build this linear model over an enlarged space. The visible result is that the resultant model can make non-linear decisions on your data. Since you can provide a custom kernel encoding similarities between data points, you can make use of problem knowledge to make such kernel focus in the relevant parts of your problem. Doing this effectively, however, can be difficult, so in general almost everybody uses the plug-and-play gaussian kernel.

Non-linear SVMs are partially interpretable, as they tell you which training data are relevant for prediction, and which aren't. This is not possible for other methods such as Random Forests or Deep Networks.

Unfortunately non-linear SVMs are slow. The state-of-the-art algorithm is Sequential Minimal Optimization, which shows quadratic performance, and is widely implemented through the LIBSVM library in a number of machine learning libraries, scikit-learn included.

Popularity of these methods

It is true that SVMs are not so popular as they used to be: this can be checked by googling for research papers or implementations for SVMs vs Random Forests or Deep Learning methods. Still, they are useful in some practical settings, specially in the linear case.

Also, bear in mind that due to the no-free lunch theorem no machine learning method can be shown to be superior to any other over all problems. While some methods do work better in general, you will always find datasets where a not so common method will achieve better results.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange