Question

Currently I have code, which based on a CultureInfo cultureInfo = new CultureInfo("ja-JP") does a search using

bool found = cultureInfo.CompareInfo.IndexOf(x, y,
    CompareOptions.IgnoreCase | 
    CompareOptions.IgnoreKanaType | 
    CompareOptions.IgnoreWidth
) >= 0;

As doing a x.IndexOf(y) is way faster, and my xes are plenty and rarely change, I'd like to canonicalize the xes once, and when performing the search do a simple

canonicalizedX.indexOf(canonicalize(y));

My question: Is there anything in the .net libraries which I could use do implement the canonicalize() function, using my CultureInfo and CompareOptions?

Was it helpful?

Solution 2

I ended up using LCMapStringEx and it works fine for me. It is not based upon (an arbitrary set of) CompareOptions, but the CompareInfo.GetSortKey docs lead me to LCMapString, so the effect of my indexOf of canonicalized strings should be yield the same result as CultureInfo.CompareInfo.IndexOf, using the hardcoded CompareOptions, here called dwMapFlags:

public static string Canonicalize(string src)
{
    string localeName = "ja-JP";
    string nResult = src;

    int nLen, nSize;

    uint dwMapFlags = LCMAP_LOWERCASE | LCMAP_HIRAGANA | LCMAP_FULLWIDTH;
    IntPtr ptr, pZero = IntPtr.Zero;

    nLen = src.Length;
    nSize = LCMapStringEx(localeName, dwMapFlags, src, nLen, IntPtr.Zero, 0, pZero, pZero, pZero);
    if (nSize > 0)
    {
        nSize = nSize * sizeof(char);
        ptr = Marshal.AllocHGlobal(nSize);
        try
        {
            nSize = LCMapStringEx(localeName, dwMapFlags, src, nLen, ptr, nSize, pZero, pZero, pZero);
            if (nSize > 0) nResult = Marshal.PtrToStringUni(ptr, nSize);
        }
        finally
        {
            Marshal.FreeHGlobal(ptr);
        }
    }

    return nResult;
}

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
static extern int LCMapStringEx(
     string lpLocaleName,
     uint dwMapFlags,
     string lpSrcStr,
     int cchSrc,
     [Out]
     IntPtr lpDestStr,
     int cchDest,
     IntPtr lpVersionInformation,
     IntPtr lpReserved,
     IntPtr sortHandle);

private const uint LCMAP_LOWERCASE = 0x100;
private const uint LCMAP_UPPERCASE = 0x200;
private const uint LCMAP_SORTKEY = 0x400;
private const uint LCMAP_BYTEREV = 0x800;
private const uint LCMAP_HIRAGANA = 0x100000;
private const uint LCMAP_KATAKANA = 0x200000;
private const uint LCMAP_HALFWIDTH = 0x400000;
private const uint LCMAP_FULLWIDTH = 0x800000;

I also tried Microsoft.VisualBasic.StrConv, which works, but is twice as slow as pinvoking LCMapStringEx.

OTHER TIPS

You are basically asking: "Does .NET give me a way to map katakana to hirigana and full width to half width so I can perform a fast comparison?" To which the answer is a resounding No. You'd have to implement that yourself.

Which is quite difficult. String comparison in .NET is driven by rather extensive character comparison tables. They are however optimized for comparison, not for character substitution. You can get some insight in the way the CLR does this by looking at the source code. Download the SSCLI20 distribution and take a look at the clr\src\classlibnative\nls\sortingtable.cpp source code file. The NativeCompareInfo::LongCompareStringW() function does the comparison, you'll see it use the COMPARE_OPTIONS_IGNOREKANATYPE and COMPARE_OPTIONS_IGNOREWIDTH flags. Note how it uses special rules for Kana, taking the "slow path". This function is massive, the odds that you can reverse-engineer a substitution algorithm from this are sufficiently low to zero to give this up quickly. Japanese orthography is complicated.

If the strings you compare are stable then consider storing the comparison result and re-use that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top