Regex: Match a string up to a pattern (Which may not exist) [duplicate]

https://stackoverflow.com/questions/17088712

31-05-2022
|

Question

I'm trying to parse an XML document using a regex tokenizer in Python (It's a finite set, so regex is just fine!), and I'm having trouble matching comments properly.

The format of these comments is in the form  where the comment itself can contain all sorts of non-alphanumeric characters (including '-')

I want to match them in such a way that I break the comment up into the following tokens:

<!--

This is a comment

-->

The beginning marker is easy enough to get, and I successfully grab that with another regex, but the comment regex itself is getting too greedy and grabbing the -- from the end marker. I want this regex to grab strings that are not necessarily enclosed in a comment as well, though, so it should also be able to take <Tag>This is text</Tag> and correctly return This is text.

This is the regex I am currently using for the text:

[^<>]+(?!-->)

The end result ends up being This is a comment--, when I just want This is a comment so that my other regex can grab the -->. This regex does work for normal tags, however, due to the existence of the '<' on the end tag and properly returns This is text from my previous example.

I know I must not be using negative lookahead properly. Any ideas on what I am doing wrong here? I've tried [^<>]+(?=-->), but then that doesn't match anything that's not a comment of this form (like the normal tags). I figured that the (?!-->) would stop the matching when it sees that pattern, but it doesn't seem to work like that and just continues matching until it sees the ending '>'.

Posting a segment of code for context:

xml_scanner = re.Scanner([
    (r"  ",             lambda scanner,token:("INDENT", token)),
    (r"<[A-Za-z\d._]+(?!\/)>",  lambda scanner,token:("BEGINTAG", token)),
    (r"<\/[A-Za-z\d._]+(?!\/)>",  lambda scanner,token:("ENDTAG", token)),
    (r"<[A-Za-z\d._]+\/>",      lambda scanner,token:("INLINETAG", token)),
    (r"<!--",               lambda scanner,token:("BEGINCOMMENT", token)),
    (r"-->",                lambda scanner,token:("ENDCOMMENT", token)),
    (r"[^<>]+(?!-->)",         lambda scanner,token:("DATA", token)),
    (r"\r$", None),
])

for line in database_file:
    results, remainder = xml_scanner.scan(line)

This is the only thing the script is doing at the moment.

Solution

Your problem is that in [^<>]+(?!-->), the lookahead assertion is only tried after This is a comment-- has been matched, and of course it succeeds because there is no --> at the position where the regex engine has ended up.

Therefore, you have to check the lookahead at every position of the string. The proper idiom for this is generally:

(?:(?!-->).)*

or, in your case

(?:(?!-->)[^<>])*

This matches any number of characters except angle brackets until either --> occurs or the end of the string is reached.

OTHER TIPS

You need to use a positive lookahead, to ensure that this pattern is ahead of your match.

[^<>]+(?=-->)

If you want to make it more universal, so that it matches content of other tags too, you can use an alternation inside the lookahead assertion:

[^<>]+(?=-->|</)

See it here on Regexr

This regex will stop, if there is "-->" or "</" ahead.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow