I have been working with the <regex> library (Microsoft Visual Studio 2012: Update 3), trying to use it to implement a slightly safer loading procedure for my application, and have been having a few teething difficulties (cf. Regular Expression causing Stack Overflow, Concurrently using std::regex, defined behaviour? and ECMAScript Regex for a multilined string).

I have got around my initial troubles (incurring a stack overflow, etc.) by using the regex suggested here, and it has been working well; however, if my file is too big, then it causes a stack overflow (which I circumvented by increasing the stack commit and reserve sizes), or if the stack size is large enough not to cause a stack overflow then it results in a std::regex_error with error code 12 (error_stack).

Here is a self-contained example to replicate the issue:

#include <iostream>
#include <string>
#include <regex>

std::string szTest = "=== TEST1 ===\n<Example1>:Test Data\n<Example2>:More Test Data\n<Example3>:Test\nMultiline\nData\n<Example4>:test_email@test.com\n<Example5>:0123456789\n=== END TEST1 ===\n=== TEST2 ===\n<Example1>:Test Data 2\n<Example2>:More Test Data 2\n<Example3>:Test\nMultiline\nData\n2\n<Example4>:test_email2@test.com\n=== END TEST2 ===\n=== TEST3 ===\n<Example1>:Random Test Data\n<Example 2>:More Random Test Data\n<Example 3>:Some\nMultiline\nRandom\nStuff\n=== END TEST3 ===\n\
                      === TEST1 ===\n<Example1>:Test Data (Second)\n<Example2>:Even More Test Data\n<Example3>:0123456431\n=== END TEST1 ===";

int main()
{
    static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );

    for( std::sregex_iterator itObject( szTest.cbegin(), szTest.cend(), regexObject ), end; itObject != end; ++itObject )
    {
        std::cout << "Type: " << (*itObject)[1].str() << std::endl;
        std::cout << "Data: " << (*itObject)[2].str() << std::endl;

        std::cout << "-------------------------------------" << std::endl;
    }
}

Compiling this with the default stack size (4kB commit and 1MB reserve) will result in a Stack Overflow exception being thrown; and upon changing the stack size (8kB commit and 2MB reserve) it results in a std::regex_error being thrown with error code 12 (error_stack).

Is there anything I can do to prevent these errors, or is it simply that the regex library was designed to be used only with small strings (i.e. DoB checking etc.)?

Thanks in advance!

有帮助吗?

解决方案 2

The problem is the back reference (\1). Back references are evil, or at least very difficult to implement in the general case, and it's not easy to recognize not-general cases.

In your case, the problem is that the regex's first match will be from the first === TEST1 === to the last === END TEST1 ===. That's not what you intended, but it is the way regexes work. (The "longest left-most rule".) In theory, it's still possible to match the regex without killing the stack, but I doubt whether the regex library you're using is clever enough to make that optimization.

You can fix the regex to match what you want it to match by making the data part (((?:.|\\n)*)) non-greedy: change it to ((?:.|\\n)*?). That might also fix the stack blow-up problem, because it will cause the regex to match much earlier, before it blows up the stack. But I don't know if it will work in general; I really don't know anything about the MS implementation.

In my opinion, you should avoid back references, even though it means complicating your code a bit. What I would do is to first match:

 === ([^=]+) ===\n

and then create the terminating string:

 "\n=== END " + match[1].str() + " ==="

and then find() the terminating string. That means you can no longer use the regex library's iterator, which is unfortunate, but the loop is still pretty straight-forward.

By the way, I find it odd that you only recognize the start delimiter if it is at the end of a line, and the end delimiter if it is at the start of a line. My inclination would have been to require both of them to be full lines. If you replace the regex-with-back-reference with my two-step approach, it's relatively easy to accomplish that. That might be considered another hint that the regex-with-back-reference is not really the right approach.

其他提示

Forget <regex> – at least for now, potentially for good. In my opinion, the spec is broken and unusable; but even if it isn’t, at least current implementations are, and probably will be for years to come.

This is because all major vendors implement their own regex engines from scratch instead of relying on existing, tried and tested libraries. This is a huge endeavour.

My recommendation: Use another regex library for now and give <regex> a wide berth. Alternatives are Boost.Regex, Boost.Xpressive and (C-style) libraries such as PCRE or Oniguruma.

Incidentally, we had a discussion about this very topic today in the chat. If you’ve got half an hour, you can read my detailed rant and some interesting counter-points.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top