The numbers that you give in your pattern are UTF-16 code unit values (different from code points). See the Unicode Glossay for reference. If you paste "HÈńr" to unicodelookup.com, you'll notice that 'ń' has Unicode code point 0x144, bigger than 0xFF that you've specified as the upper end of the acceptable range.
To accept all Unicode characters, you need the following expression:
[\x0-\xFFFF]*
To accept only the first plane characters - those where one code unit (QChar
) always corresponds to one code point, you need the following expression:
[\x0-\xD7FF\xE000-\xFFFF]*
The name byteArray
that you gave to your regular expression is outright deceptive: a QString
is not an array of bytes, not an array of Unicode code points, but an array of UTF-16 code units.
The code points in the first plane (U+0000 to U+D7FF and U+E000 to U+FFFF) are represented in UTF-16 as a single code unit. QChar
is always a code unit. Code points from other, supplementary planes, are represented as two QChar
code units - a surrogate pair.
Dealing with such pairs complicates matters. Suppose you wanted to match on '𐎘', code point 0x10398. This is represented as two code units in UTF-16: 0xD800 0xDF98. The pattern would be:
([\xD800][\xDF98])
#include <QString>
#include <QDebug>
#include <QRegExp>
int main()
{
uint data[] = { 0x10398, 0 };
QString s = QString::fromUcs4(data);
QRegExp r("^([\\xD800][\\xDF98])$");
qDebug() << s.size() << s.contains(r);
}
The output is:
2 true
If you wanted to match certain ranges only, say the first plane and the storied Linear-B Syllabary range from 10000–1007F, you'd use the pattern:
([\x0-\xD7FF\xE000-\xFFFF]|([\xD800][\xDC00-\xDC7F]))*
int main()
{
uint data[] = { 0x30, 0x40, 0x10000, 0x1007F, 0 };
QString s = QString::fromUcs4(data);
QRegExp r("^([\\x0-\\xD7FF\\xE000-\\xFFFF]|([\\xD800][\\xDC00-\\xDC7F]))+$");
qDebug() << s.size() << s.contains(r);
}
The output is:
6 true