Should the data be shuffled on a translation dataset

https://datascience.stackexchange.com/questions/64703

20-10-2020
|

Question

As reference to Why should the data be shuffled for machine learning tasks I was curious if this is the case for neural-machine-translation.

I would like to analyze why I think this is the case for the 3 components (train, test, val) parts where the data (as a reference dataset think of the europarl-dataset) is split:

for the training part, when doing mini-batch gradient descent it is better that for each update we collect data that are closer semantically, so that the learning gets focused each time on a certain direction. For example is better to have a lot of topics for let's say coffee together for a single update, than topics about drugs or coffee, dogs etc in a single update
for the test part, we would like to see how the model behaves in a big text of continuous sentences, as the goal of translation is that of documents. The same should hold for validation

I would like your view on that if you have any empirical or theoretical insight of why this is not the case for neural machine translation.

Solution

Yes, in NMT data is always shuffled.

For training, if each batch contains the same "type of content" (e.g. domain, register), then your model will be biased toward the type of content of the last batches. You don't want that. Apart from that, in NMT data is normally bucketed by length to avoid wasting batch space with padding.

For the test data, you would like to have data that follows a similar distribution to the one to be used in production. If the test is biased to only one domain, then you won't be able to assess correctly the expected performance of the model.

Document-level translation has little to do with the shuffling of the data. In document-level MT you normally have context sentences that are either ingested with an auxiliary encoder or as part of a single, longer sentence. This may reduce the number of sentences you can fit in the batch, but each sentence (+context) is independent from the rest.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange