Convert non english characters into Unicode (UTF-8)

https://stackoverflow.com/questions/9045932

20-04-2021
|

سؤال

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below. enter image description here

How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.

பெயர்

The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil

I need this conversion to be done, so I can copy them into database tables.

المحلول

"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.

As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.

Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml

Let me know if this works. If not, I can ask around and get something for you.

نصائح أخرى

'Ja-01' is a font with a custom 'visual encoding'.

That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.

This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.

Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.

For example here's a trivial Python script to replace characters in a file:

mapping= {
    u'a': u'\u0BAF',   # Tamil letter Ya
    u'b': u'\u0BAA',   # Tamil letter Pa
    u'g': u'\u0BC6',   # Tamil vowel sign E (combining)
    u'h': u'\u0BB0',   # Tamil letter Ra
    u';': u'\u0BCD',   # Tamil sign virama (combining)
    # fill in the rest of the mapping information here!
}

with open('ja01data.txt', 'rb') as fp:
    data= fp.read().decode('utf-8')
for char in mapping:
    data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
    fp.write(data.encode('utf-8'))

The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.

Ditch the font and enter Unicode so it looks like this:

enter image description here

You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow