Its not been bug with Unicode character or TextElementEnumerator Class, As specific to the lanaguage (Tamil)
letter made by any Tamil consonants followed by visual glyph
for eg- க -\u0b95 ெ -\u0bc6 ள -\u0bb3
form Tamil character 'கெள' while its seems similar to formation of visual glyph
க -\u0b95 ௌ-\u0bcc
and its right form to solution. hence before enumerating Tamil character we have replace irregular formation of character.
As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்) the visual glyph (ௌ) will come as starting letter of a word.
so that. the above code is to be should processed as
internal class Program
{
private static void Main()
{
const string unicodetxt1 = "ஊரவர் கெளவை";
List<string> output = Syllabify(unicodetxt1);
Console.WriteLine(output.Count);
const string unicodetxt2 = "கௌவை";
output = Syllabify(unicodetxt2);
Console.WriteLine(output.Count);
}
public static string CheckVisualGlyphPattern(string txt)
{
string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string list = string.Empty;
var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
foreach (string s in data)
{
var matches = new List<Match>();
string outputs = rx.Replace(s, match =>
{
matches.Add(match);
return string.Format("{0}\u0bcc", match.Groups[1].Value);
});
list += string.Format("{0} ", outputs);
}
return list.Trim();
}
public static List<string> Syllabify(string unicodetext)
{
var processdata = CheckVisualGlyphPattern(unicodetext);
if (string.IsNullOrEmpty(processdata)) return null;
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
var data = new List<string>();
while (enumerator.MoveNext())
data.Add(enumerator.Current.ToString());
return data;
}
}
It produce the appropriate visual glyph while enumerating.