Question

I have a batch size of 128 and a total data size of around 10 million, and I am classifying between 4 different label values.

How much of a problem is it if each batch only contains data with one label?

So for example - batch 0 all have the 3rd label. Batch 1 all have the 1st. Batch 2 the 2nd. Etc.

Was it helpful?

Solution

If you are interested in convergence, then most probably it shouldn't converge. I used "should" because although we have an intuition of how the convergence work but no one is 200% sure of everything.


Let's understand a few of the aspects of the whole learning process.
- Every data point has a distinct Loss space
- Overall Loss space will be the average of all individual Loss surface
- We can safely assume, that the overall Loss space of the data points in one class will be closer to each other but very different when compared to the other class.
- Lastly, the Gradient is calculated at the end of a Batch


When we have well-shuffled batch -
In this case, the Gradient after each batch will be directed towards the overall Gradient because the Gradient will be averaged. It means a smooth Loss reduction with each batch


When we have each batch of one Class -
In this case, every alternate batch will move on two different Loss space(One of each Class).
The issue will be that we will calculate Loss for one space, but after the weight update at the end of a batch, the next batch will continue in its own space but using the weight from the previous batch. This will make learning very unpredictable.
Also, many logic applied in optimizers will not help because Gradient will randomly change with each batch.

I tried this on MNIST digit (just using 0,1 digit) and got this Loss per Batch. I have not out enough effort on LR optimization, LR decay, etc. So, can't conclude that result will be always similar.

$\hspace{3cm}$enter image description here

OTHER TIPS

If every batch has only a single label, the model will not learn very well. Each update will try to minimize the error for only a single label. Ultimately, the model will not find a set of weights that for all labels at the same time.

If a single batch happens to randomly have only a single label, it won't effect training too much since the other batches will have all labels represented.

Welcome to StackExchange!

Yes, the idea of mini-batches is to augment balance in an unbalanced dataset.

You should train on balanced datasets (i.e same prevalence of all classes) and measure performance on a representative dataset as discussed here and here.


A neural network trained with imbalanced data often has varied levels of precision in determining each class depending on the difference in the number of class samples in the training data, which is a significant problem in industrial quality inspection. As a solution to this problem, we propose a balanced mini-batch training method that can virtually balance the class ratio of training samples. In an experiment, the neural network trained with the proposed method achieved higher classification ability than that trained with over-sampled or undersampled data for two types of imbalanced image datasets.

[Ryota et al]

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top