I'm trying to parse an XML document using a regex tokenizer in Python (It's a finite set, so regex is just fine!), and I'm having trouble matching comments properly.
The format of these comments is in the form <!--This is a comment--> where the comment itself can contain all sorts of non-alphanumeric characters (including '-')
I want to match them in such a way that I break the comment up into the following tokens:
<!--
This is a comment
-->
The beginning marker is easy enough to get, and I successfully grab that with another regex, but the comment regex itself is getting too greedy and grabbing the -- from the end marker. I want this regex to grab strings that are not necessarily enclosed in a comment as well, though, so it should also be able to take <Tag>This is text</Tag> and correctly return This is text.
This is the regex I am currently using for the text:
[^<>]+(?!-->)
The end result ends up being This is a comment--, when I just want This is a comment so that my other regex can grab the -->. This regex does work for normal tags, however, due to the existence of the '<' on the end tag and properly returns This is text from my previous example.
I know I must not be using negative lookahead properly. Any ideas on what I am doing wrong here? I've tried [^<>]+(?=-->), but then that doesn't match anything that's not a comment of this form (like the normal tags). I figured that the (?!-->) would stop the matching when it sees that pattern, but it doesn't seem to work like that and just continues matching until it sees the ending '>'.
This is the only thing the script is doing at the moment.
Solution
Your problem is that in [^<>]+(?!-->), the lookahead assertion is only tried after This is a comment-- has been matched, and of course it succeeds because there is no --> at the position where the regex engine has ended up.
Therefore, you have to check the lookahead at every position of the string. The proper idiom for this is generally:
(?:(?!-->).)*
or, in your case
(?:(?!-->)[^<>])*
This matches any number of characters except angle brackets until either --> occurs or the end of the string is reached.
OTHER TIPS
You need to use a positive lookahead, to ensure that this pattern is ahead of your match.
[^<>]+(?=-->)
If you want to make it more universal, so that it matches content of other tags too, you can use an alternation inside the lookahead assertion: