Question

I am writing a small tokenizer in Python for a custom mini-format which looks like this (it can be nested too):

<tag:some_text>

tag is a combination of a finite set of values and some_text is just text. The delimiters <, : and > can be escaped by a single \ if they appear in, and as, text.

I used the regex r"((\\)?[<:>])" along with re.finditer to find delimiters and then remove the backslash if necessary by checking with token.startswith('\\'). The problem is that if more backslashes come before the delimiter, the regex is wrong, e.g. "<tag:Some \\\\< text>" -> ['<', 'tag', ':', 'Some \\\\', '<', ' text', '>'].

I cannot find a sensible solution using regexes, and I am considering just writing the tokenization in pure Python, i.e. no regex magic etc (but that may be slow?) or am I overcomplicating this? Any suggestions?

Was it helpful?

Solution

Your regular expression will only match the last backslash and delimiter \< in \\\\<.

Just add the + quantifier which means (1 or more times)

((\\+)?[<:>])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top