Question

I have QRegExp with the following pattern

QRegExp byteArray;
byteArray.setPattern("[\\x00-\\xff]*");

This is patterns is used to validate QString's. Can some one provide example of what kinds of QString's can not pass this test for the pattern above? I have a bug in which there comes a QString which doesn't match the pattern.

Cand this pattern match any Unicode string?

Example of QString that doesn't get validated by pattern: HÈńr

Why?

Was it helpful?

Solution

QString uses UTF-16 internally, not UTF-8.

You also need to start with \x0001 for QRegExp.

int main()
{
        uint data[] = { 0x10c436, 0 };
        QString s = QString::fromUcs4(data);
        QRegExp r("^[\\x0001-\\xffff]+$");
        qDebug() << s.size() << s.contains(r);
}

will result in a match,

2 true

NOTE: If you are using QRegularExpression, the above will no longer match. QRegularExpression uses pcre UTF16 so there must be some fancy checking in PCRE code, although it reports no errors. I haven't looked further into it.

Also, QRegularExpression accepts \x0000, but QRegExp does not.

Moral of the story is don't try to match binary data with regular expression.

OTHER TIPS

The numbers that you give in your pattern are UTF-16 code unit values (different from code points). See the Unicode Glossay for reference. If you paste "HÈńr" to unicodelookup.com, you'll notice that 'ń' has Unicode code point 0x144, bigger than 0xFF that you've specified as the upper end of the acceptable range.

To accept all Unicode characters, you need the following expression:

[\x0-\xFFFF]*

To accept only the first plane characters - those where one code unit (QChar) always corresponds to one code point, you need the following expression:

[\x0-\xD7FF\xE000-\xFFFF]*

The name byteArray that you gave to your regular expression is outright deceptive: a QString is not an array of bytes, not an array of Unicode code points, but an array of UTF-16 code units.

The code points in the first plane (U+0000 to U+D7FF and U+E000 to U+FFFF) are represented in UTF-16 as a single code unit. QChar is always a code unit. Code points from other, supplementary planes, are represented as two QChar code units - a surrogate pair.

Dealing with such pairs complicates matters. Suppose you wanted to match on '𐎘', code point 0x10398. This is represented as two code units in UTF-16: 0xD800 0xDF98. The pattern would be:

([\xD800][\xDF98])
#include <QString>
#include <QDebug>
#include <QRegExp>
int main()
{
   uint data[] = { 0x10398, 0 };
   QString s = QString::fromUcs4(data);
   QRegExp r("^([\\xD800][\\xDF98])$");
   qDebug() << s.size() << s.contains(r);
}

The output is:

2 true

If you wanted to match certain ranges only, say the first plane and the storied Linear-B Syllabary range from 10000–1007F, you'd use the pattern:

([\x0-\xD7FF\xE000-\xFFFF]|([\xD800][\xDC00-\xDC7F]))*
int main()
{
   uint data[] = { 0x30, 0x40, 0x10000, 0x1007F, 0 };
   QString s = QString::fromUcs4(data);
   QRegExp r("^([\\x0-\\xD7FF\\xE000-\\xFFFF]|([\\xD800][\\xDC00-\\xDC7F]))+$");
   qDebug() << s.size() << s.contains(r);
}

The output is:

6 true
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top