What's a Good Machine Translation Metric or Gold Set

https://stackoverflow.com/questions/8510762

16-03-2021
|

题

I'm starting up looking into doing some machine translation of search queries, and have been trying to think of different ways to rate my translation system between iterations and against other systems. The first thing that comes to mind is getting translations of a set of search terms from mturk from a bunch of people and saying each is valid, or something along those lines, but that would be expensive, and possibly prone to people putting in bad translations.

Now that I'm trying to think of something cheaper or better, I figured I'd ask StackOverflow for ideas, in case there's already some standard available, or someone has tried to find one of these before. Does anyone know, for example, how Google Translate rates various iterations of their system?

解决方案

I'd suggest refining your question. There are a great many metrics for machine translation, and it depends on what you're trying to do. In your case, I believe the problem is simply stated as: "Given a set of queries in language L1, how can I measure the quality of the translations into L2, in a web search context?"

This is basically cross-language information retrieval.

What's important to realize here is that you don't actually care about providing the user with the translation of the query: you want to get them the results that they would have gotten from a good translation of the query.

To that end, you can simply measure the discrepancy of the results lists between a gold translation and the result of your system. There are many metrics for rank correlation, set overlap, etc., that you can use. The point is that you need not judge each and every translation, but just evaluate whether the automatic translation gives you the same results as a human translation.

As for people proposing bad translations, you can assess whether the putative gold standard candidates have similar results lists (i.e. given 3 manual translations do they agree in results? If not, use the 2 that most overlap). If so, then these are effectively synonyms from the IR perspective.

其他提示

There is some information here that might be useful as it provides a basic explanation of the BLEU scoring technique that is often used to measure the quality of an MT system by developers.

The first link provides a basic overview of BLEU and the second points out some problems with BLEU in terms of it's limitations.

http://kv-emptypages.blogspot.com/2010/03/need-for-automated-quality-measurement.html

and

http://kv-emptypages.blogspot.com/2010/03/problems-with-bleu-and-new-translation.html

There is also some very specific pragmatic advice on how to develop a useful Test Set at this link: AsiaOnline.Net site in the November newsletter. I am unable to put this link in as there is a limit of two.

In our MT Evaluation we use hLEPOR score (see the slides for details)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow