I think you have two problems. One is your whole approach (so skip to the bottom if you just want my real advice), but it looks like the other is catastrophic backtracking.
Why this is breaking
If we simplify your pattern a bit, it boils down to this:
{a}{x*}{x*}{b}
Notice the two x*
's right next to each other? Yes, there's a (?=y)
between them, but let's ignore that for a minute, because I don't think the engine is using that efficiently to limit the amount of work it's doing. Suppose you have a string like axxxxxxxb
and you want to match it against the pattern. Since there are two x*
tokens next to each other, the engine can't tell easily where one group ends and the other begins. So it tries to put them all in the first {x*}
bucket, since the *
is greedy:
{a}{xxxxxxx}{}{b}
Great, right? It matched, so we can move on. But consider something like axxxxxxQxb
. That doesn't match on the first pass, so the engine has to keep trying permutations:
{a}{xxxxxxx}{}{Q} #nope
{a}{xxxxxx}{x}{Q} #nope
{a}{xxxxx}{xx}{Q} #nope
...
Eventually, this takes so long it blows up your stack.
Improving the regex
So how to fix it? Well, there's this:
(?:(?=201[0-3]</pubDate>))
I think the engine will do a better job if it's an affirmative token, rather than a lookahead. It doesn't need to be a lookahead anyway; you can just use this (with or without the \s*
):
201[0-3]\s*</pubDate>
The (?:(?!</item>).)*
after that is redundant; you should just need a lazy .*?
.
Also, you can use the Multiline option to make .
also match newlines, but I'm not sure whether that will make any difference in terms of speed/execution. It'll be shorter to write, though.
The whole thing would look something like:
<item>(?:(?!</item>).)*?201[0-3]</pubDate>.*?</item> #plus the /m flag
The REAL solution
But I think the real problem is that you're using regex at all. This looks like XML. Why aren't you using an XML parser? If you're using .NET, LINQ to XML is perfect for the exact job you're describing, including the part about specific values in the nested pubdate
. Should be way easier and more efficient than a regex.