How to pick proper train data for LibSVM?

Question

There are a variety of things that are commonly done in this setup which is called imbalanced data. There are many important problems in computer science that are like this: search engines have millions of documents and only a handful are relevant to a search term, face detector will have to make millions of no-detections where there are not face (natural scenes and such). Many things can be done.

First thing is you need to change a little bit how you measure the accuracy. As you already saw you can get 99.5% accuracy by just saying that all data points are of the negative class, still this classifier is absolutely useless from a predictive stand point.

One technique that is commonly used is to build an ROC curve or a precision-recall curve to determine a reasonable operating point for your classifier.

In many cases the objectives of the problem dictate different weights for each class that fortunately LIBSVM supports. For example is confusing a positive for a negative 100 times more expensive than confusing a negative for a positive? you can use w1 100 w-1 1 when training the SVM.

Of course do not forget the importance of finding a good C (or a good C and gamma if using RBF).

In general it is not a thing of selecting a subset to train on, it is a thing of adjusting the training and testing mechanism so it works reasonably in your setup.