Question

I want to convert QStrings into filenames. Since I'd like the filename to look clean, I want to replace all non-letters and non-numbers by an underscore. The following code should do that.

#include <iostream>
#include <QString>

QString makeFilename(const QString& title)
{
    QString result;
    for(QString::const_iterator itr = title.begin(); itr != title.end(); itr++)
     result.push_back(itr->isLetterOrNumber()?itr->toLower():'_');
    return result;
}

int main()
{
    QString str = "§";
    std::cout << makeFilename(str).toAscii().data() << std::endl;
}

However, on my computer, this does not work, I get as an output:

�_

Looking for an explentation, debugging tells me that QString("§").size() = 2 > 1 = QString("a").size().

My questions:

  • Why does QString use 2 QChars for "§"? (solved)
  • Do you have a solution for makeFilename? Would it also work for Chinese people?
Was it helpful?

Solution

In addition to what others have said, keep in mind that a QString is a UTF-16 encoded string. A Unicode character that is outside of the BMP requires 2 QChar values working together, called a surrogate pair, in order to encode that character. The QString documentation says as much:

Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.

You are not taking that into account when looping through the QString. You are looking at each QChar individually without checking if it belongs to a surrogate pair or not.

Try this instead:

QString makeFilename(const QString& title) 
{ 
    QString result; 

    QString::const_iterator itr = title.begin();
    QString::const_iterator end = title.end();

    while (itr != end)
    {
        if (!itr->isHighSurrogate())
        {
            if (itr->isLetterOrNumber())
            {
                result.push_back(itr->toLower()); 
                ++itr;
                continue;
            }
        }
        else
        {
            ++itr;
            if (itr == end)
                break; // error - missing low surrogate

            if (!itr->isLowSurrogate())
                break; // error - not a low surrogate

            /*
            letters/numbers should not need to be surrogated,
            but if you want to check for that then you can use
            QChar::surrogateToUcs4() and QChar::category() to
            check if the surrogate pair represents a Unicode
            letter/number codepoint...

            uint ch = QChar::surrogateToUcs4(*(itr-1), *itr);
            QChar::Category cat = QChar::category(ch);
            if (
                ((cat >= QChar::Number_DecimalDigit) && (cat <= QChar::Number_Other)) ||
                ((cat >= QChar::Letter_Uppercase) && (cat <= QChar::Letter_Other))
                )
            {
                result.push_back(QChar(ch).toLower()); 
                ++itr;
                continue;
            }
            */
        }

        result.push_back('_');
        ++itr; 
    }

    return result; 
} 

OTHER TIPS

Ok, here's my theory: when you feed the "§" literal to a QString, Qt uses some default encoding because you didn't set one. If your compiler uses UTF-8 to store string literals, you might be feeding it 2 bytes which are converted into 2 characters instead of one. Likewise, your "toAscii" output most likely does the wrong thing too.

From the looks of it, you'll have to find out what your compiler uses to store string literals, and call setCodecForCStrings with the correct value.

EDIT: given your description, if I didn't know the encoding for my compiler, I would probably try QTextCodec::codecForName("UTF-8") as parameter to the setCodec first :-)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top