Romanization of Unicode text

https://stackoverflow.com/questions/9842527

26-05-2021
|

Question

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.

Examples:

Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")

Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")

Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")

It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.

Does such a library exist?

Solution

You can use Unidecode Sharp :

[a C#] port from Python Unidecode that itself port from Perl unidecode. (there are also PHP and Ruby implementations available)

Usage;

using BinaryAnalysis.UnidecodeSharp;

.......................................

string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());

string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());

string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());

I hope, it will be good for you.

OTHER TIPS

The problem is a lot more complex than you think.

Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.

I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.

The Unicode Common Locale Data Repository has some transliteration mappings you could use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow