Question

Machine translation models are usually evaluated using bleu score. I want to get some intuition for this score. What is the bleu score of professional human translator?

I know it depends on the languages, the translator ect. I just want to get the scale.

edit: I want to make it clear - I talk about the expected bleu. It's not a theoretical question, it is an experimental one.

Was it helpful?

Solution

The original paper "BLEU: a Method for Automatic Evaluation of Machine Translation" contains a couple of numbers on this:

The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. It is important to note that the more reference translations per sentence there are, the higher the score is. Thus, one must be cautious making even “rough” comparisons on evaluations with different numbers of reference translations: on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.

But as their table 1 (providing the numbers compared to two references, H2 is the one mentioned in the text above) shows there is variance among human BLEU scores:

enter image description here

Unfortunately, the paper does not qualify the skill level of the translators.

OTHER TIPS

BLEU scores are based on comparing the translation to evaluate against a gold-standard translation. In general the gold-standard translation is the same source sentence translated by a professional translator, so in theory a professional human translation should always receive the maximum score of 1 (BLEU scores are normalized between 0 and 1).

However it's important to keep in mind that:

  • Even professional translators don't always agree on what is the "correct" translation, so there's no perfect evaluation method.
  • There can be multiple valid translations for the same sentences. This can be taken into account in the BLEU score, but most of the time BLEU scores are calculated using a single translation. As a consequence it's possible that a perfectly good translation gets a low score.
  • BLEU scores are based on counting the number of n-grams in common between the predicted translation and the gold-standard. It's a quite good proxy for translation quality, but it's also very basic. In particular it cannot take the meaning into account.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top