Converting Unicode string to unicode chars in c# for indian languages

https://stackoverflow.com/questions/13966487

11-12-2021
|

Question

I need to convert unicode string to unicode characters.

for eg:Language Tamil

"கமலி"=>'க','ம','லி'

i'm able to strip unicode bytes but producing unicode characters is became problem.

byte[] stringBytes = Encoding.Unicode.GetBytes("கமலி");
char[] stringChars = Encoding.Unicode.GetChars(stringBytes);
foreach (var crt in stringChars)
 {
     Trace.WriteLine(crt);
 }

it gives result as :

'க'=>0x0b95

'ம'=>0x0bae

'ல'=>0x0bb2

'ி'=>0x0bbf

so here the problem is how to strip character 'லி' as it as 'லி' without splitting like 'ல','ி'.

since it is natural in Indian language by representing consonant and vowel as single characters but parsing with c# make difficulty.

All i need to be split into 3 characters.

Solution

To iterate over graphemes you can use the methods of the StringInfo class.

Each combination of base character + combining characters is called a 'text element' by the .NET documentation, and you can iterate over them using a TextElementEnumerator:

var str = "கமலி";
var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(str);
while (enumerator.MoveNext())
{
    Console.WriteLine(enumerator.Current);
}

Output:

க
ம
லி

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow