How to design a high performance Key Matching algorithm for a Translation Memory/Cache?

https://stackoverflow.com/questions/21929796

14-10-2022
|

Question

Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.

I've read a lot of articles/blogs/papers, but still don't know where to start. TD-IDF+cosine similarity seems not good enough? Levenshtein distance? What about semantic similarity? But how?

I read about this In the comments @mbatchkarov seems provide a correct direction.

Does anyone has similar experience on the subject? Any suggestions are welcome.

Solution

A lot of the time the accepted answer to the question you linked to can get you quite far. You can compare the word (lemma) overlap between a query and all queries in the cache. To improve performance, you can incorporate word similarity to help you link semantically similar words. The thesaurus-building software I linked to in my is BSD-licensed, so you are free to use it as you see fit. If you need any help using it, the developers (disclaimer: I am a part of the team) will be happy to help out. In fact, I've got a few pre-built thesauri lying around. These should probably be a part of the software, but they are too large to upload to github.

Whichever approach you go for, be aware that there will be many cases where this does not work well. This is because the approaches discussed in that question are about semantic similarity, and your application may require semantic equivalence. For example, "I like big ginger cats" and "We like big ginger cats" or "We like small ginger cats" are very similar in meaning, but it would be wrong to use the translation of one as a translation of the other.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow