Pergunta

I have a text corpus which is already aligned at sentence level by construction - it is a list of pairs of English strings and their translation in another language. I have about 10 000 strings of 5 - 20 words each and their translations. My goal is to try to build a metric of the quality of the translation - automatically of course, because I'm dealing with languages I know nothing about :)

I'd like to build a dictionary from this list of translations that would give me the (most probable) translation of each word in the source English strings into the other language. I know the dictionary will be far from perfect but I'm hoping I can have something good enough to flag when a word is not consistently translated, for example, if my dictionary says "Store" is to be tranlated into French by "Magasin" then if I spot some place where "Store" is translated as "Boutique" I can suspect that something is wrong.

So I'd need to:

  1. build a dictionary from my corpus
  2. align the words inside the string/translation pairs

Do you have good references on how to do this? Known algorithms? I found many links about text alignment but they seem to be more at the sentence level than at the word level...

Any other suggestion on how to automatically check whether a translation is consistent would be greatly appreciated!

Thanks in advance.

Foi útil?

Solução

A freely available (specifically, GPL-licensed) tool for word alignment is GIZA++. I trains the well-known IBM models mentioned in other answers, as well as other statistical models.

You can download it from the GIZA++ site at Google Code, and there is a brief introduction to its usage found at the GIZA++ Apertium. It boils down to this procedure:

  1. Create your parallel corpus, sentence-aligned (you seem to have this already)
  2. Apply the plain2snt tool included in GIZA++ to extract word lists and sentence lists in GIZA++ format
  3. (Optional – only used for some models:) Generate word classes using the mkcls tool (also included)
  4. Run the actual word alignment tool GIZA++. There are various optional configuration settings you can use to determine the type of model generated.

Before you can do this, you must build the tool from source code by running make. The code is written in C++ and compiles well with recent GCC versions.

A few final notes:

  • There are more than one possible translations for every word; you shouldn't rely on the assumption that a specific translation found in one text is necessarily wrong just because the same word is translated differently in another text;

  • One word may be translated into a (not necessarily contiguous) sequence of several words, and vice versa. Some words are not translated at all;

  • GIZA++ is a statistical tool that approximates the correct word alignment; many of the alignments it generates are questionable or incorrect.

Outras dicas

This a pretty standard statistical machine translation problem called 'word alignment'.
There are bunch of EM clustering-based models developed by researchers at IBM which I think are the base for most other cooler models being developed today.
Google for 'ibm word alignment models' to find about IBM Models 1 to 5.
This presentation - http://www.stanford.edu/class/cs224n/handouts/cs224n-lecture-05-2011-MT.pdf seems like a good place to start.

Are you using spaces between words? Whatever character you are using, you might check out the slice command in Linux. It gives you the ability to filter words in-between spaces and other characters.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top