I am writing a small tokenizer in Python for a custom mini-format which looks like this (it can be nested too):

<tag:some_text>

tag is a combination of a finite set of values and some_text is just text. The delimiters <, : and > can be escaped by a single \ if they appear in, and as, text.

I used the regex r"((\\)?[<:>])" along with re.finditer to find delimiters and then remove the backslash if necessary by checking with token.startswith('\\'). The problem is that if more backslashes come before the delimiter, the regex is wrong, e.g. "<tag:Some \\\\< text>" -> ['<', 'tag', ':', 'Some \\\\', '<', ' text', '>'].

I cannot find a sensible solution using regexes, and I am considering just writing the tokenization in pure Python, i.e. no regex magic etc (but that may be slow?) or am I overcomplicating this? Any suggestions?

有帮助吗?

解决方案

Your regular expression will only match the last backslash and delimiter \< in \\\\<.

Just add the + quantifier which means (1 or more times)

((\\+)?[<:>])
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top