Handling Hunspell suggestions with special characters

https://stackoverflow.com/questions/11269586

18-06-2021
|

Question

I've integrated Hunspell in an unmanaged C++ app on Windows 7 using Visual Studio 2010.

I've got spell checking and suggestions working for English, but now I'm trying to get things working for Spanish and hitting some snags. Whenever I get suggestions for Spanish the suggestions with accent characters are not translating properly to std::wstring objects.

Here is an example of a suggestion that comes back from the Hunspell->suggest method:

Hunspell- loading= suggest(...) result">

Here is the code I'm using to translate that std::string to a std::wstring

std::wstring StringToWString(const std::string& str)
{
    std::wstring convertedString;
    int requiredSize = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, 0, 0);
    if(requiredSize > 0)
    {
        std::vector<wchar_t> buffer(requiredSize);
        MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, &buffer[0], requiredSize);
        convertedString.assign(buffer.begin(), buffer.end() - 1);
    }

    return convertedString;
}

And after I run that through I get this, with the funky character on the end.

After conversion to wstring

Can anyone help me figure out what could be going on with the conversion here? I have a guess that it's related to the negative char returned from hunspell, but don't know how I can convert that to something for the std::wstring conversion code.

Solution 2

It looks like the output of Hunspell is ASCII with code page 28591 (ISO 8859-1 Latin 1; Western European (ISO)) which I found by looking at the Hunspell default settings for the unix command line utility.

Changing the CP_UTF8 to 28591 worked for me.

// Updated code page to 28591 from CP_UTF8
std::wstring StringToWString(const std::string& str)
{
    std::wstring convertedString;
    int requiredSize = MultiByteToWideChar(28591, 0, str.c_str(), -1, 0, 0);
    if(requiredSize > 0)
    {
        std::vector<wchar_t> buffer(requiredSize);
        MultiByteToWideChar(28591, 0, str.c_str(), -1, &buffer[0], requiredSize);
        convertedString.assign(buffer.begin(), buffer.end() - 1);
    }

    return convertedString;
}

Here is a list of code pages from MSDN that helped me find the correct code page integer.

OTHER TIPS

It looks like the output of Hunspell is ASCII with code page 852. Use 852 instead of CP_UTF8 http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

Or configure Hunspell to return UTF8.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow