Question

I am building a binary classification model for imbalanced data (e.g., 90% Pos class vs 10% Neg Class).

I already balanced my training dataset to reflect a a 50/50 class split, while my holdout (training dataset) was kept similar to the original data distribution (i.e., 90% vs 10%). My question is regarding the validation data used during the CV hyperparameter process. During each iteration fold should:

1) Both the training and test folds be balanced

or

2) The training fold should be kept balanced while the validation fold should be made imbalanced to reflect the original data distribution and holdout dataset.

I am currently using the 1st option to tune my model; however, is this approach valid given that the holdout and validation datasets have different distributions?

Was it helpful?

Solution

Both test and validation datasets should have the same distribution. In such a case, the performance metrics on the validation dataset are a good approximation of the performance metrics on the test dataset. However, the training dataset can be different. Also, it is fine and sometimes helpful to balance the training dataset. On the other hand, balancing the test dataset could lead to a bias estimation from the performance of the model because the test dataset should reflect the original data imbalance. As I mentioned at the beginning the test and validation datasets should have the same distribution. Since balancing the test dataset is not allowed, the validation dataset can not be validated too.

Additionally, I should mention that when you balance the test dataset, you will get a better performance in comparison to using an unbalanced dataset for testing. And of course, using a balanced test set does not make sense as explained above. So, the resulted performance is not reliable unless you use an unbalanced dataset with the same distribution of classes as the actual data.

OTHER TIPS

In my opinion the validation set should follow the original imbalanced distribution: the goal is ultimately to apply the model to the real distribution so the hyper-parameters should be chosen to maximize performance for this distribution.

But since I'm not completely sure I'd suggest trying both options, and adopt the one which gives the best performance on the test set.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top