Pergunta

Answered:

This regex works:

<item>(?:(?!</item>).|\n)*?(?:(?=201[0-3]</pubDate>))(?:(?!</item>).|\n)*?</item>

while this one crashes the stack:

<item>(?:(?!</item>).|\n)*(?:(?=201[0-3]</pubDate>))(?:(?!</item>).|\n)*</item>

This also works, without lookaheads:

(?s)<item>.*?201[0-3]</pubDate>.*?</item>

Original question:

I have an XML file in Sublime Text 2 (example below). I want to find all the <item> elements that contain a <pubDate> element from the years 2010 through 2013.

The above regex works correctly, but when I do find all (the file is about 1MB with about 120 matches) ST2 runs out of stack space.

What horrible inefficiencies lurk above?

Example XML:

<?xml version="1.0" encoding="utf-8"?>
    <channel>
        <item>
            <title>This will match</title>
            <link>http://gcanyon.posterous.com/</link>
            <pubDate>Sat Mar 10 10:22:00 -0800 2012</pubDate>
            <dc:creator><![CDATA[Geoff Canyon]]></dc:creator>
        </item>
        <item>
            <title>This won't</title>
            <link>http://gcanyon.posterous.com/</link>
            <pubDate>Tue Jun 30 05:01:32 -0700 2009</pubDate>
            <dc:creator><![CDATA[Geoff Canyon]]></dc:creator>
        </item>
    </channel>
</rss>
Foi útil?

Solução

Greedy greedy regex. For example:

(?:(?!</item>).|\n)*

will go all the way to the next </item> while it's not what you want, you only don't want it to go farther I'll assume.

You should find happiness in lazy operators.

PS: Sorry I don't have enough time to look more deeply into your regex. Hopefully it'll solve your problem.

Outras dicas

I think you have two problems. One is your whole approach (so skip to the bottom if you just want my real advice), but it looks like the other is catastrophic backtracking.

Why this is breaking

If we simplify your pattern a bit, it boils down to this:

{a}{x*}{x*}{b}

Notice the two x*'s right next to each other? Yes, there's a (?=y) between them, but let's ignore that for a minute, because I don't think the engine is using that efficiently to limit the amount of work it's doing. Suppose you have a string like axxxxxxxb and you want to match it against the pattern. Since there are two x* tokens next to each other, the engine can't tell easily where one group ends and the other begins. So it tries to put them all in the first {x*} bucket, since the * is greedy:

{a}{xxxxxxx}{}{b}

Great, right? It matched, so we can move on. But consider something like axxxxxxQxb. That doesn't match on the first pass, so the engine has to keep trying permutations:

{a}{xxxxxxx}{}{Q} #nope
{a}{xxxxxx}{x}{Q} #nope
{a}{xxxxx}{xx}{Q} #nope
...

Eventually, this takes so long it blows up your stack.

Improving the regex

So how to fix it? Well, there's this:

(?:(?=201[0-3]</pubDate>))

I think the engine will do a better job if it's an affirmative token, rather than a lookahead. It doesn't need to be a lookahead anyway; you can just use this (with or without the \s*):

201[0-3]\s*</pubDate>

The (?:(?!</item>).)* after that is redundant; you should just need a lazy .*?.

Also, you can use the Multiline option to make . also match newlines, but I'm not sure whether that will make any difference in terms of speed/execution. It'll be shorter to write, though.

The whole thing would look something like:

<item>(?:(?!</item>).)*?201[0-3]</pubDate>.*?</item>  #plus the /m flag

The REAL solution

But I think the real problem is that you're using regex at all. This looks like XML. Why aren't you using an XML parser? If you're using .NET, LINQ to XML is perfect for the exact job you're describing, including the part about specific values in the nested pubdate. Should be way easier and more efficient than a regex.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top