Question

I have a data set where each row represents ad/banner impression. Some impressions belongs to +1 class - which means that user clicked on ad after this impression, others belong to -1 class. Data set contains 1% of +1 rows and 99% of -1 rows. Total number of rows is about 6 million.

I've made 2 experiments:

  1. When I divide data set into 2 parts of the same size I get 99,95% total accuracy, but 0% accuracy on +1 class.
  2. When I take half of all +1 rows into training set and append the same number of -1 rows (so training set contains 50% of +1 and 50% of -1 rows) and put the rest part of +1 rows and another portion of -1 rows in the testing data set - I get 95% accuracy. But when I try to use train model for larger data set (99% of -1 and 1% of +1 rows) - I got only 3% accuracy which is not enough for production use.

Could you please advise how much rows of each class I should put into training set? How large training set should be (in total)? How to train model properly in my case?

Was it helpful?

Solution

There are a variety of things that are commonly done in this setup which is called imbalanced data. There are many important problems in computer science that are like this: search engines have millions of documents and only a handful are relevant to a search term, face detector will have to make millions of no-detections where there are not face (natural scenes and such). Many things can be done.

First thing is you need to change a little bit how you measure the accuracy. As you already saw you can get 99.5% accuracy by just saying that all data points are of the negative class, still this classifier is absolutely useless from a predictive stand point.

One technique that is commonly used is to build an ROC curve or a precision-recall curve to determine a reasonable operating point for your classifier.

In many cases the objectives of the problem dictate different weights for each class that fortunately LIBSVM supports. For example is confusing a positive for a negative 100 times more expensive than confusing a negative for a positive? you can use w1 100 w-1 1 when training the SVM.

Of course do not forget the importance of finding a good C (or a good C and gamma if using RBF).

In general it is not a thing of selecting a subset to train on, it is a thing of adjusting the training and testing mechanism so it works reasonably in your setup.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top