Question

I have to write classifier for corpus of texts, which should separate all my texts into 2 classes. The corpus is very large (near 4 millions for test, and 50000 for study). But, what algorithm should I choose?

  • Naive Bayesian
  • Neural networks
  • SVM
  • Random forest
  • kNN (why not?)

I heard that Random forests and SVM is state-of-the-art methods, but, maybe someone has a deal with listed above algorithms, and knows, which is fastest and which more accurate?

Was it helpful?

Solution

As a 2-classes text classifier, I don't think you need:

(1) KNN: it is a clustering method rather than classification, and it is slow;

(2) Random forest: the decision trees may not be a good option in high sparse dimensions;

You can try:

(1) naive bayesian: most straightforward and easiest to code. Proved to work well in text classification problems;

(2) logistic regression: works well if your training sample number is much larger than the feature number;

(3) SVM: again, for training sample much more than features, SVM with linear kernel works as well as logistic regression. And it is also one of the top algorithms in text classification;

(4) Neural network: seems like a panacea in machine learning. In theory it can learn any models that SVM/logistic regression could. The problem is there are not so many packages on NN as there are in SVM. As a result, the optimization process for neural network is time-consuming.

Yet it is hard to say which algorithm is best suit for your case. If you are using python, scikit-learn includes almost all these algorithms for you to test. Besides, weka, which integrates many machine learning algorithms in a user friendly graphic interface, is also a good candidate for you to better know the performance of each algorithm.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top