Pergunta

When would one use Random Forest over SVM and vice versa?

I understand that cross-validation and model comparison is an important aspect of choosing a model, but here I would like to learn more about rules of thumb and heuristics of the two methods.

Can someone please explain the subtleties, strengths, and weaknesses of the classifiers as well as problems, which are best suited to each of them?

Foi útil?

Solução

I would say, the choice depends very much on what data you have and what is your purpose. A few "rules of thumb".

Random Forest is intrinsically suited for multiclass problems, while SVM is intrinsically two-class. For multiclass problem you will need to reduce it into multiple binary classification problems.

Random Forest works well with a mixture of numerical and categorical features. When features are on the various scales, it is also fine. Roughly speaking, with Random Forest you can use data as they are. SVM maximizes the "margin" and thus relies on the concept of "distance" between different points. It is up to you to decide if "distance" is meaningful. As a consequence, one-hot encoding for categorical features is a must-do. Further, min-max or other scaling is highly recommended at preprocessing step.

If you have data with $n$ points and $m$ features, an intermediate step in SVM is constructing an $n\times n$ matrix (think about memory requirements for storage) by calculating $n^2$ dot products (computational complexity). Therefore, as a rule of thumb, SVM is hardly scalable beyond 10^5 points. Large number of features (homogeneous features with meaningful distance, pixel of image would be a perfect example) is generally not a problem.

For a classification problem Random Forest gives you probability of belonging to class. SVM gives you distance to the boundary, you still need to convert it to probability somehow if you need probability.

For those problems, where SVM applies, it generally performs better than Random Forest.

SVM gives you "support vectors", that is points in each class closest to the boundary between classes. They may be of interest by themselves for interpretation.

Outras dicas

SVM models perform better on sparse data than does trees in general. For example in document classification you may have thousands, even tens of thousands of features and in any given document vector only a small fraction of these features may have a value greater than zero. There are probably other differences between them, but this is what I found for my problems.

It really depends what you want to achieve, what your data look like and etc. SVM will generally perform better on linear dependencies, otherwise you need nonlinear kernel and choice of kernel may change results. Also, SVM are less interpretable - for e.g if you want to explain why the classification was like it was - it will be non-trivial. Decision trees have better interpretability, they work faster and if you have categorical/numerical variables its fine, moreover: non-linear dependencies are handled well (given N large enough). Also they train faster than SVM in general, but they have tendency to overfit...

I would also try Logistic Regression - great interpretable classifier)

To sum it up - the rule of thumb is try anything and compare what gives you best results/interpretation.

To complement the good points already stated:

From Do We Need Hundreds of Classifiers to Solve Real World Classification Problems? random forests are more likely to achieve a better performance than random forests.

Besides, the way algorithms are implemented (and for theoretical reasons) random forests are usually much faster than (non linear) SVMs. Indeed as @Ianenok, SVMs tend to be unusable beyond 10 000 data points.

However, SVMs are known to perform better on some specific datasets (images, microarray data...).

So, once again, cross validation is indeed the best way to know which method performs best.

Source : Random forest vs SVM

Licenciado em: CC-BY-SA com atribuição
scroll top