Correctly accounting for multiple backslashes when tokenizing custom mini-format

https://stackoverflow.com/questions/23550991

18-07-2023
|

Question

I am writing a small tokenizer in Python for a custom mini-format which looks like this (it can be nested too):

<tag:some_text>

tag is a combination of a finite set of values and some_text is just text. The delimiters <, : and > can be escaped by a single \ if they appear in, and as, text.

I used the regex r"((\\)?[<:>])" along with re.finditer to find delimiters and then remove the backslash if necessary by checking with token.startswith('\\'). The problem is that if more backslashes come before the delimiter, the regex is wrong, e.g. "<tag:Some \\\\< text>" -> ['<', 'tag', ':', 'Some \\\\', '<', ' text', '>'].

I cannot find a sensible solution using regexes, and I am considering just writing the tokenization in pure Python, i.e. no regex magic etc (but that may be slow?) or am I overcomplicating this? Any suggestions?

Solution

Your regular expression will only match the last backslash and delimiter \< in \\\\<.

Just add the + quantifier which means (1 or more times)

((\\+)?[<:>])

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow