Unusually High BLEU score on a NMT model

https://datascience.stackexchange.com/questions/76052

12-12-2020
|

Question

This is the project on Neural Machine Translation on the English/Irish language pair. I have been spending the past month or so trying to train a good baseline to do 'experimentation' on. I have a corpus of ~850k sentences (unfortunately Irish is very limiting). When I trained it and evaluated it with BLEU, I got a score of 65.02, which is obviously absurdly incorrect. These were my Fairseq-train settings:

!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin-full_corp/MayNMT \
  --lr 5e-4 --lr-scheduler inverse_sqrt --optimizer adam\
  --clip-norm 0.1 --dropout 0.2 --max-tokens 4096 \
  --arch transformer --save-dir checkpoints/full-tran

I know not everyone uses Fairseq in NLP, but I hope the arguements are self-explanatory.

I deduplicated the dataset (converted to a Python set() which only takes unique entries) so I don't think the issue is the dev/valid and test sets contain duplicate entries, but I'm not sure what else causes this. Some suggest overfitting may be a cause, but I feel that would only affect the BLEU if the dev set shared training entries. I've tried to find the problem myself, but there aren't many places that cover NMT, let alone BLEU.

Solution

According to recent publications, it is not impossible to get BLEU scores as high as yours for English→Irish. Nevertheless, without any other knowledge, they certainly seem too high.

From the command line arguments, there does not seem to be any evident problem.

The most probable explanation is, as you already pointed out, a data leakage between validation/test and training. Note that, while you removed exact duplicates, you may be getting partial matches that go unnoticed. You may want to look into different similarity metrics. The most straightforward is the Jaccard Similarity.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange