Question

NOTE: When I say the regex [\0] I mean the regex [\0] (not contained in a C-style string, which would then be "[\\0]"). If I haven't put quotes around it, it's not a C-style string, and the backslashes shouldn't be interpreted as escaping a C-style string.

Inspired by this question and my investigation, I tried the following code in clang 3.4:

#include <regex>
#include <string>

int main()
{
    std::string input = "foobar";
    std::regex regex("[^\\0]*"); // Note, this is "\\0", not "\0"!

    return std::regex_match(input, regex);
}

Apparently, clang doesn't like this, as it throws:

std::__1::regex_error: The expression contained an invalid escaped character, or a trailing escape.

It seems to be the [^\0] part (changing it to [^\n] or something similar works fine). It seems to be an invalid escape character. I want to clarify that I'm not talking about the '\0' character (null-character) or '\n' character (newline character). In C-style strings, what I'm talking about is "\\0" (a string containing backslash zero) and "\\n" (a string containing backslash n). "\\n" seems to get transformed into "\n" by the regex engine, but it chokes on "\\0".

The C++11 standard says in section 28.13 [re.grammar] that:

The regular expression grammar recognized by basic_regex objects constructed with the ECMAScript flag is that specified by ECMA-262, except as specified below.

I'm no expert on ECMA-262, but I tried the regular expression on JSFiddle and it's working fine there in JavaScript land.

So now I'm wondering if the regex [^\0] is valid in ECMA-262 and the C++11 standard removed support for it (in the stuff following ... except as specified below.).

Question: Is the \0 (not the null-character; in a string literal this would be "\\0") escape sequence legal in a C++11 regular expression? Is it legal in ECMA-262 (or are browser JS VMs just being "too" lenient)? What's the cause/justification for the different behaviors?

Was it helpful?

Solution

This was a bug in libc++'s implementation of <regex>. It should be fixed now in the trunk, and this should propagate to OS X's release code eventually.

Also, here is the excerpt from the ECMA 262 Standard that is the basis for this bug report:

15.10.2.11 DecimalEscape

The production DecimalEscape :: DecimalIntegerLiteral [lookahead ∉ DecimalDigit] evaluates as follows:

  1. Let i be the MV of DecimalIntegerLiteral.
  2. If i is zero, return the EscapeValue consisting of a <NUL> character (Unicode value 0000).
  3. Return the EscapeValue consisting of the integer i.

Note: ... \0 represents the <NUL> character and cannot be followed by a decimal digit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top