Question

Further to my previous question: ECMAScript Regex for a multilined string, I have implemented the following loading procedure:

void Load( const std::string& szFileName )
{
     static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
     static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );

     std::ifstream inFile( szFileName );
     inFile.exceptions( std::ifstream::badbit );

     std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );

     inFile.close();

     std::vector<std::future<void>> vecFutures;

     for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
     {
          if( (*itObject)[1] == "OBJECT1" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
          else if( (*itObject)[1] == "OBJECT2" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
     }

     for( auto& future : vecFutures )
     {
          future.get();
     }
}

However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4):

=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:test@test.co.uk
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===

I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator loop is responsible.

Thanks in advance!

Was it helpful?

Solution

Here's another attempt:

=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===

In your C++ it would obviously be written as:

=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===

It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it.

It makes two assumptions, which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said):

  1. That there's never a === at the start of a line, except for the start/end marker lines.
  2. That C++ supports these regex features - specifically the use of a negative lookahead (?!). It should, considering it's ECMAScript dialect.

Explained:

=== ([^=]+) ===\n

Match and capture the object start marker. The [^=] is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ], because I do not know if there may be spaces in the OBJECT id.

((?:

Start capturing group for data. Inside it, a non-capturing group, because we're going to match each line individually.

   (?!===)

Negative lookahead - we don't want === at the start of our captured line.

   [^\n]+\n

Matches one line individually.

)+)

Match at least one line between start and end markers, then capture ALL the lines in a single group.

=== END \1 ===

Match the end marker.

Comparison (using RegexBuddy):

Original version:

  • First match: 1277 steps
  • Failed match: 1 step (this is due to the line break between the objects)
  • Second match: 396 steps

Every added object will cause the amount of steps to grow for the previous ones. E.g., adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps.

This version:

  • First match: 67 steps
  • Failed match: 1 step (once again due to the line break between the objects)
  • Second match: 72 steps
  • Failed match: 1 step
  • Third match: 67 steps

OTHER TIPS

Holy catastrophic backtracking. The culprit is (?:.|\\n)*. Whenever you see a construct like this you know you're asking for trouble.

Why? Because you're telling the engine to match any character (except newline) OR newline, as many times as possible, or none. Let me walk you through it.

The engine will start as expected and match the === OBJECT2 ===-part without any major issues, a newline will be consumed, and hell will then begin. The engine consumes EVERYTHING, all the way down to === END OBJECT1 ===, and backtrack its way from there to a suitable match. Backtracking basically means going back one step and applying the regex again to see if it works. Basically trying all possible permutations with your string. This will, in your case, result in a few hundred thousand attempts. That's probably why stuff is being problematic for you.

I don't know if your code is any better or if it has any errors in it, but (?:.|\\n)* is the same as writing .* with the *s*ingle line modifier (dot matches newlines) or [\S\s]*. If you replace that construct with one of the two I have recommended you will hopefully no longer see a stack overflow error.

Edit: Check out the other solutions too, I did not really have time to go in-depth and provide a solid solution yo your problem besides explaining why its so bad.

Your expressions appear to be causeing a lot of backtracking. I would change your expressions to:

First: ^===\s+(.*?)\s+===[\r\n]+^(.*?)[\r\n]+^===\s+END\s+\1\s+===

Second: ^<([^>]+)>:([^<]*)

Both of these expressions work with the options: Multiline, and DotMatchesAll options. By including the start of line anchor ^ it limits the backtracking to at most one line or one group.

Try with this pattern instead:

static const std::regex regexObject( "=== (\\S+) ===\\n((?:[^\\n]+|\\n(?!=== END \\1 ===))*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top