Question

why the TextElementEnumerator not properly parsing the Tamil Unicode character.

using System;
using System.Collections.Generic;
using System.Globalization;

namespace Glyphtest
{
    internal class Program
    {
        private static void Main()
        {
            const string unicodetxt1 = "ஊரவர் கெளவை";
            List<string> output = Syllabify(unicodetxt1);
            Console.WriteLine(output.Count);
            const string unicodetxt2 = "கௌவை";
            output = Syllabify(unicodetxt2);
            Console.WriteLine(output.Count);
        }

        public static List<string> Syllabify(string unicodetext)
        {
            if (string.IsNullOrEmpty(unicodetext)) return null;
            TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
            var data = new List<string>();
            while (enumerator.MoveNext())
                data.Add(enumerator.Current.ToString());
            return data;
        }
    }
}

Following above code sample deals with Unicode character

'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)

'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)

Is it bug in Text Element Enumerator Class , why its not to Enumerate it properly from the string.

i.e கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form

கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.

If so how to overcome this issue.

Was it helpful?

Solution

Its not been bug with Unicode character or TextElementEnumerator Class, As specific to the lanaguage (Tamil)

letter made by any Tamil consonants followed by visual glyph

for eg- க -\u0b95 ெ -\u0bc6 ள -\u0bb3

form Tamil character 'கெள' while its seems similar to formation of visual glyph

க -\u0b95 ௌ-\u0bcc

and its right form to solution. hence before enumerating Tamil character we have replace irregular formation of character.

As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்) the visual glyph (ௌ) will come as starting letter of a word.

so that. the above code is to be should processed as

internal class Program
{
    private static void Main()
    {
        const string unicodetxt1 = "ஊரவர் கெளவை";
        List<string> output = Syllabify(unicodetxt1);
        Console.WriteLine(output.Count);
        const string unicodetxt2 = "கௌவை";
        output = Syllabify(unicodetxt2);
        Console.WriteLine(output.Count);
    }

    public static string CheckVisualGlyphPattern(string txt)
    {
        string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
        string list = string.Empty;
        var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
        foreach (string s in data)
        {
            var matches = new List<Match>();
            string outputs = rx.Replace(s, match =>
            {
                matches.Add(match);
                return string.Format("{0}\u0bcc", match.Groups[1].Value);
            });
            list += string.Format("{0} ", outputs);
        }
        return list.Trim();
    }

    public static List<string> Syllabify(string unicodetext)
    {
        var processdata = CheckVisualGlyphPattern(unicodetext);
        if (string.IsNullOrEmpty(processdata)) return null;
        TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
        var data = new List<string>();
        while (enumerator.MoveNext())
            data.Add(enumerator.Current.ToString());
        return data;
    }
}

It produce the appropriate visual glyph while enumerating.

OTHER TIPS

U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ has Grapheme_Cluster_Break=XX (Other). This makes the grapheme clusters <U+0BC8 U+0BC6><U+0BB3> the correct ones since there is always a grapheme cluster break before characters with Grapheme_Cluster_Break equal to Other.

<U+0BC8 U+0BCC> has no internal grapheme cluster breaks because U+0BCC has Grapheme_Cluster_Break=SpacingMark and there are usually no breaks before such characters (exceptions are at the start of text or when preceded by a control character).

Well, at least this is what the Unicode standard has to say (http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries).

Now, I have no idea of how Tamil works, so take what follows with a pinch of salt.

U+0BCC decomposes into <U+0BC6 U+0BD7>, meaning the two sequences (<U+0BC8 U+0BC6 U+0BB3> and <U+0BC8 U+0BCC>) not canonically equivalent, so there is no requirement for grapheme cluster segmentation to yield the same results.

When I look at it with my Tamil-ignorant eyes, it seems U+0BCC ᴛᴀᴍɪʟ ᴀᴜ ʟᴇɴɢᴛʜ ᴍᴀʀᴋ and U+0BB3 ᴛᴀᴍɪʟ ʟᴇᴛᴛᴇʀ ʟʟᴀ look exactly the same. However, U+0BCC is a spacing mark, but U+0BB3 isn't. If you use U+0BCC in the input instead of U+0BB3, the result is what you expected.

Going on a limb, I will say that you are using the wrong character but, again, I don't know Tamil at all so I can't be sure.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top