While downsampling training data should we also downsample the validation data or retain validation split as it is?

https://datascience.stackexchange.com/questions/74386

11-12-2020
|

Question

I am dealing with class imbalance problem. In this case, I am down sampling the majority class lables in the training set.

Among training, validation and test splits, the majority class in training split is down-sampled, and test split is retained as it is. However, should the validation split be downsampled according to the training-set or should it be retained as it is?

This is because the validation set controls the training process.

Solution

I would recommend not to downsample the validation set. In the end you care about performance on the test set with the skewed class distribution. Therefore your validation set (used for hyperparameter selection, early stopping etc.) should have the same distribution in my opinion.

Have you considered upsampling the minority class? By downsampling you loose training data, which might contain valuable information and therefore might harm the learning process.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange