Pergunta

I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction.

The thing is that, as you may know, the dataset is very unbalanced : Around 1300 negative examples (non click) for 1 positive example (click).

This is what I do:

  1. Load the data
  2. Split the dataset into 3 datasets : A = Training (60%) B = Validating (20%) C = Testing (20%)
  3. For each dataset (A, B, C), do an under-sampling on each negative class in order to have a ratio of 5 (5 negative example for 1 positive example). This give me 3 new datasets which are more balanced: A’ B’ C’

Then I train my model with the dataset A’ and logistic regression.

My question are:

  1. Which dataset do I have to use for validation ? B or B’ ?

  2. Which dataset do I have to use for testing ? C or C’

  3. Which metrics are the most relevant to evaluate my model? F1Score seems to be a well used metric. But here due to the unbalanced class (if I use the datasets B and C), the precision is low (under 0.20) and the F1Score is very influenced by low recall/precision. Would that be more accurate to use aucPR or aucROC ?

  4. If I want to plot the learning curve, which metrics should I use ? (knowing that the %error isn’t relevant if I use the B’ dataset for validating)

Thanks in advance for your time !

Regards.

Foi útil?

Solução

Great question... Here are some specific answers to your numbered questions:

1) You should cross validate on B not B`. Otherwise, you won't know how well your class balancing is working. It couldn't hurt to cross validate on both B and B` and will be useful based on the answer to 4 below.

2) You should test on both C and C` based on 4 below.

3) I would stick with F1 and it could be useful to use ROC-AUC and this provides a good sanity check. Both tend to be useful with unbalanced classes.

4) This gets really tricky. The problem with this is that the best method requires that you reinterpret what the learning curves should look like or use both the re-sampled and original data sets.

The classic interpretation of learning curves is:

  • Overfit - Lines don't quite come together;
  • Underfit - Lines come together but at too low an F1 score;
  • Just Right - Lines come together with a reasonable F1 score.

Now, if you are training on A` and testing on C, the lines will never completely come together. If you are training on A` and testing on C` the results won't be meaningful in the context of the original problem. So what do you do?

The answer is to train on A` and test on B`, but also test on B. Get the F1 score for B` where you want it to be, then check the F1 score for B. Then do your testing and generate learning curves for C. The curves won't ever come together, but you will have a sense of the acceptable bias... its the difference between F1(B) and F1(B`).

Now, the new interpretation of your learning curves is:

  • Overfit - Lines don't come together and are farther apart than F1(B`)-F1(B);
  • Underfit - Lines don't come together but the difference is less than F1(B`)-F1(B) and the F1(C) score is under F1(B);
  • Just right - Lines don't come together but the difference is less than F1(B`)-F1(B) with an F1(C) score similar to F1(B).

General: I strenuously suggest that for unbalanced classes you first try adjusting your class weights in your learning algorithm instead of over/under-sampling as it avoids all of the rigor moral that we've outlined above. Its very easy in libraries like scikit-learn and pretty easy to hand code in anything that uses a sigmoid function or a majority vote.

Hope this helps!

Outras dicas

For 1) and 2), you want to

1) choose a model that performs well on data distributed as you 
   expect the real data will be 
2) evaluate the model on data distributed the same way

So for those datasets, you shouldn't need to balance the classes.

You might also try using class weights instead of under/oversampling, as this takes care of this decision for you.

For 3) you likely want to optimize using whatever metric you will be scored on (if it's a competition). But if that isn't a consideration, all those models are fine choices. F1 may be influenced by the low precision, but you want that to be captured. It's precisely when naive models (like guessing the majority class) can score well by some metrics that scores like F1 are relevant.

As for 4) there is nothing wrong with showing whichever metric you end up optimizing on.

You should test your classifier on a dataset that represents the why it will be used. The best is usually unmodified distribution.

During the learning, modify the dataset in anyway that helps you.

For details, see Should I go for a 'balanced' dataset or a 'representative' dataset?

Licenciado em: CC-BY-SA com atribuição
scroll top